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Abstract 

We study the Principal Component Analysis (PCA) problem in the distributed and streaming 
models of computation. Given a matrix A G a rank parameter k < rank(A), and an 

accuracy parameter 0 < e < 1, we want to output an m x fc orthonormal matrix U for which 

||A-UUTA|||<(l + e)-||A-Ad||, 

where A^ G jg j|jgg|. approximation to A. We show the following. 

1. In the arbitrary partition model of Karman et al. (COLT 2014), each of s machines holds 

a matrix A* and A = Each machine should output U. Karman et al. achieve 

0{skm/e) + poly(sfc/e) words (of 0(log(nm)) bits) communication. We obtain the im¬ 
proved bound of 0{skm) -|- poly(sfc/s) words, and show an optimal n{skm) lower bound 
up to low order terms. This resolves an open question for high precision PCA. A poly (e“^) 
dependence is known to be required, but we separate this dependence from m. 

2. We bypass the above lower bound when A is (ji-sparse in each column and each server 
receives a subset of columns. Here we obtain an 0{sk(p/e) -t- poly(sfc/s) word protocol. 
Our communication is independent of the matrix dimensions, and achieves the guarantee 
that each server, in addition to outputting U, outputs a subset of 0{k/e) columns of A 
containing a U in its span (that is, we solve distributed column subset selection). We show 
a matching il{sk(j)/s) lower bound for distributed column subset selection. Achieving our 
communication bound when A is sparse but not sparse in each column, is impossible. 

3. In the streaming model in which the columns arrive one at a time, an algorithm of Liberty 
(KDD, 2013) with an improved analysis by Ghashami and Phillips (SODA, 2014) shows 
0{km/e) "real numbers" of space is achievable in a single pass, which we first improve to 
an 0(fcm/s)-l-poly(fc/e) word space upper bound. This almost matches a known n{km/e) 
bit lower bound of Woodruff (NIPS, 2014). We show with two passes one can achieve 
0{km) + poly(fc/e) words of space and (up to the poly(fc/e) term and the distinction 
between words versus bits) this is optimal for any constant number of passes. 

4. In turnstile streams, in which we receive entries of A one at a time in an arbitrary or¬ 
der, we show how to obtain a factorization of a (1 -I- e)-approximate rank-fc matrix using 
0{{m + n)k£~^) words of space. This improves the 0{(m + n£~‘^)k£~‘^) bound of Clarkson 
and Woodruff (STOC 2009), and matches their + n)ke~^) word lower bound. 

Notably, our results do not depend on the condition number of A. 


1 Introduction 


In distributed-memory computing systems, such as, Hadoop [1] or Spark [3], Principal Compo¬ 
nent Analysis (PCA) and the related Singular Value Decomposition (SVD) of large matrices is 
becoming very challenging. Machine learning libraries implemented on top of such systems, for 
example mahout [2] or mllib [4], provide distributed PCA implementations since PCA is often 
used as a building block for a learning algorithm. PCA is useful for dimension reduction, noise 
removal, visualization, etc. In all of these implementations, the bottleneck is the communication; 
hence, the focus has been on minimizing the communication cost of the related algorithms, and 
not the computational cost, which is the bottleneck in more traditional batch systems. 

The data matrix corresponding to the dataset in hand, e.g., a term-document matrix represent¬ 
ing a text collection, or the Netflix matrix representing user's ratings for different movies, could 
be distributed in many different ways [46]. In this paper, we focus on the following so-called arbi¬ 
trary partition and column partition models. In the arbitrary partition model, a matrix A G 
is arbitrarily distributed among s machines. Specifically, A = J2i=i where Aj G 
by machine i. Unless otherwise stated, we always assume m < n. In the column partition model, 
a matrix A G ^g distributed column-wise among s < n machines: A = (Ai A 2 ... A,^) ; 

here, for i = 1 : s, Aj is an m x rcj column submatrix of A with Wi = n. Note that the col¬ 
umn partition model is a special case of the arbitrary partition model. Thus, it is desirable to 
prove upper bounds in the arbitrary partition model, and lower bounds in the column partition 
model. Both models have been adapted by traditional numerical linear algebra software libraries 
for distributed memory matrix computations [12]. 

A recent body of work [29, 10, 41, 31, 37, 13] (see Section 3 for a comparison) has focused 
on designing algorithms which minimize the communication needed for each machine to out¬ 
put an m X k matrix U for which an m x k orthonormal matrix U for which || A — UU^A III < 
(1 -|- e) • II A — Afclll, where A^ G jg |.|^g ]|^gg|. j-^nk-Zc approximation to A. Each machine 

should output the same matrix U which can then be used for downstream applications, such as 
clustering, where one first projects the data to a lower dimensional subspace (see, e.g., [10]). The 
protocol should succeed with large constant probability. The model is such that each of the s ma¬ 
chines communicates with one machine, called the "coordinator" (which we sometimes refer to as 
Server), but does not communicate with other machines. This is known as the coordinator model 
and one can simulate arbitrary point-to-point communication with a constant factor overhead in 
communication together with an additive 0(log(mn)) bit overhead per message (see, e.g., [45]). 

1.1 Our Results 

The best upper bound in the arbitrary partition model is 0{{skme~^) -|- s ■ poly(A:e“^)) words (of 
O(log(mn)) bits) communication [37], while the only known lower bound for these problems is in 
the arbitrary partition model and is Q.{skm) [37]. In high precision applications one may want to 
set e to be as small as possible and the leading order term of 0{skm£~^) is undesirable. A natural 
question is whether there is an 0{{skm) + s ■ poly(/ce“^)) word protocol. We note that there is a 
fairly simple known U(e“^) bit lower bound [54] (we present this in Lemma 75 for completeness), 
so although one carmot achieve an 0(log(l/e)) dependence in the communication as one would 
maybe expect given iterative algorithms for regression with rurming time 0{\og{l/e)) (see section 
2.6 of [55] for an overview), an 0{{skm) + s ■ poly(/ce“^)) would at least separate the dependence 
of e and m and allowing for much smaller e with the same amount of communication. 

There are many existing protocols using 0{skme~^) words of communication [29, 10, 41, 31, 
37,13], and the protocols work in very different ways: that of [29,10] is based on coresets, while 
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that of [41, 31] is based on adapting a streaming algorithm to the communication setting, while 
that of [37] is based on sketching, and that of [13] is based on alternating minimization and is 
useful only under some assumptions on the condition number (see Section 3 for a more detailed 
discussion of these protocols). It was thus natural to assume there should be a lower bound of 

Q.{skme~^). 

Instead, we obtain a new upper bound in the arbitrary partition model and a matching lower 
bound (up to lower order terms) in the column partition model. 

Theorem 1 (Restatement of Theorem 43 and Theorem 82). Suppose matrix A G /§ partitioned 

in the arbitrary-partition model (See Definition 14). For any 1 > e > 0, there is an algorithm which on 
termination leaves the same orthonormal matrix U G machine such that || A — UU^A III < 

(1 + e) • ||A — Afclll holds with arbitrarily large constant probability. Further, the algorithm runs in 
polynomial time with total communication complexity 0{skm + s ■ poly{ke~^)) words each containing 
0{log{smne~^)) bits. 

If matrix A G is distributed in the column-partition model (See Definition 15), then for any 

positive constant C < 0(poly(sA:m)), the algorithm which can leave a orthonormal matrix U G on 

each machine such that ||A — UU"''A||| < C ■ ||A — Afc||| holds with constant probability has at least 
D{skm\og{skm)) bits of communication. 

In some applications even an Ofskm) word protocol may be too costly, as m could be very 
large. One could hope to do for communication what recent work on input sparsity algorithms 
[22, 43, 44, 16, 25] has done for compufation, i.e., obtain protocols sensitive to the number of 
non-zero entries of A. Many mafrices are sparse, e.g., Nefflix provides a training data set of 
100,480,507 ratings fhaf 480,189 users give to 17,770 movies. If users correspond to columns 
in the matrix, then the average column sparsity is 200 non-zero elements (out of the 17, 770 
possible coordinates). 

Our second contribution is the first protocol depending on the number of non-zero entries of 
A in the column partition model. Our communication is independent of fhe matrix dimensions. 
Denote by f the maximum number of non-zero elements of a column in A. When we say that A 
is (^-sparse, we mean that every column of A has af most f non-zero elements and nnz(A) < f-n. 
Our protocol has the additional feature of leaving on each machine the same subset C oi 0{k/e) 
columns of A for which there exists an m x A: orthonormal matrix U in the column span of C for 
which II A — CC^A||| < || A — UU"''A||| < (1 -|- e) • || A — Afc|||. This is known as fhe column subset 
selection problem, which is useful since C may be sparse if A is, and also if can lead fo better data 
intepretability. To partially complement our protocol, we show our protocol is optimal for any 
profocol solving the column subset selection problem. We summarize these results as follows. 

Theorem 2 (Restafement of Theorem 63). Suppose a f-sparse matrix A G M™xn partitioned in 
the column-partition model (See Definition 15). For any 1 > e > 0, there is an algorithm which on 
termination leaves C G with c = 0{k/e) columns of A and an orthonormal matrix U on each 

machine such that || A — CC^A||| < ||A-UU^A|| 

F — (1 -|- e) • IIA — Afclll holds with arbitrarily large 
constant probability. Further, the algorithm runs in polynomial time with total communication complexity 
O [sk(j)e~^ -|- words each containing of 0{\og{smne~^)) bits. 

If a f-sparse matrix A G /g distributed in the column-partition model (See Definition 15), 

then for any positive constant C < O {poly {skf)), the algorithm which can leave a orthonormal matrix 
U G on each machine such that ||A — UU"''A||| < 67 • ||A — Afc||| holds with constant probability 
has at least Fl{sk(j)log{sk(j))) bits of communication. 

Remark 3. Our protocol has no dependence on condition number and works for arbitrary matrices. INe 
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Upper bounds 

Lower bounds 

Definition 16 (arbitrary partition model) 

0(skm + s ■ poly(fc/e)) (Theorem43) 

Q(skm) (Theorem 1.2 in [37]) 

Definition 17 (column partition model) 

0(skm + s ■ poly(fc/e)) (Theorem43) 

Q(skm) (Theorem 82) 

Definition 17 with sparsity </> = o(e ■ m) 

0(sk4>e~^ + s ■ poly(fc/e)) (Theorem63) 

r2(sfc</)) (Corollary 83) 


Table 1: Communication upper/lower bounds for the Distributed PCA problems. 


make a serious effort as shown in Section 6 to achieve this. Note that even if the entries of A are specified 
using 0{\og{mn)) bits, the condition number can be 2~^km-'ri\og{rnn)) 

Remark 4. We did not discuss the running time of the algorithms, but we are interested in the fastest 
possible algorithms with the minimal communication cost, and provide fast algorithms as well. 

Remark 5. The hard instance in Section 10.2.2 implies an Tl{skm) lower bound even if the input matrix 
is sparse overall but has skm non-zero positions in arbitrarily locations. Therefore, to obtain our bounds we 
discuss the sparsity on each column instead of the overall sparsity of the matrix. 

A model closely related to the distributed model of computation is the streaming model of 
computation. The model we focused on is the so-called turnstile streaming model. In this model, 
there is a stream of update operations and each operation indicates that the corresponding entry of 
A should be incremented by a specific number. We present novel PCA algorithms in this model. 

Our first one-pass algorithm improves upon the best existing streaming PCA algorithm [41,31] 
in two respects. First, the space of [41, 31] is described in "real numbers" while our space bound 
{0{mkle + poly(A:/e)) - see Theorem 52) is in terms of words (we also bound the word size). This 
matches an Tl{km/e) bit lower bound for one-pass algorithms in [53], up to the distinction between 
words versus bits and a low order term poly(A:/e). Second, our algorithm can be applied in the 
turnstile streaming model which is stronger than column update streaming model in [41, 31]. 

Theorem 6 (Restatement of Theorem 52). Suppose A G /s given by a stream of update operations 

in the turnstile streaming model (See Definition 20). For any 1 > e > 0, there is an algorithm which 
uses a single pass over the stream and on termination outputs an orthonormal matrix U G 
that ||A — UU"''A||| < (1 -|- e) • ||A — Afc||| holds with arbitrarily large constant probability. Further, 
the algorithm runs in polynomial time with space of total O [mk/e -|- poly(A:e“^)) words each containing 
0{log{smne~^)) bits. 

A slight modification of the previous algorithm leads to a one-pass algorithm which can com¬ 
pute a factorization of a (1 -|- e)-approximate rank-/c matrix. The modified algorithm only needs 
0((n -\- m)k/£ + polyfk/e)) words of space which improves the 0{(m + ne~‘^)k£~‘^) upper bound 
in [21] and matches the Q{{n + m)kje) bit lower bound given by [21], up to the low order term 

poly(/c/e). 

Theorem 7 (Restatement of Theorem 53). Suppose A G ^g j^y ^ stream of update operations 

in the turnstile streaming model (See Definition 20). For any 1 > e > 0, there is an algorithm which uses 
a single pass over the stream and on termination outputs a matrix A^ G rank(Af) < k such 

that II A — A^lll < (1 -|- e) • ||A — Afclll holds with arbitrarily large constant probability. Further, the 
algorithm runs in polynomial time with space of total 0({m + n)k/£ + -po\y(ke~^)) words each containing 
0{log(smne~^)) bits. 
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Upper bounds 

Lower bounds 

One-pass turnstile, Def. 21 

0(mke ^ poly(A:, £ ^)) 

(Theorem 52) 

Q{mk£ ^)bits 

[53] 

One-pass turnstile, factorization, Def. 22 

0((n + 7n)ks~^ + poly(fc, e“^)) 

(Theorem 53) 

Q((n + m)k£~^) words 

[21] 

Two-pass turnstile, Def. 21 

0(mk + poly(fc, £~^)) 

(Theorem 55) 

Q(mk) bits 

[53] 


Table 2: Space upper/lower bounds for the Streaming PCA problem. 


We also show a two-pass algorithm which is an implementation of our distributed PCA proto¬ 
col. It uses 0((A:m) -|-poly(/c/e)) words of space, which up fo the poly(A:/e) term and the distinction 
between words versus bits, is optimal for any constant number of passes. A "next natural goal" 
in [53] was to improve the lower bound of Q.{km) to a bound closer to the 1-pass VL{km/£) lower 
bound established in that paper; our upper bound shows this is not possible. 

Theorem 8 (Restatement of Theorem 55). Suppose A G /s gwen by a stream of update operations 
in the turnstile streaming model (See Definition 20). For any 1 > e > 0, there is an algorithm which uses 
two passes over the stream and on termination outputs an orthonormal matrix U G _ 

UU"^A||| < (1 -|- e) • IIA — Afclll holds with arbitrarily large constant probability. Further, the algorithm 
runs in polynomial time with total space O {mk + poly(A:e“^)) words each containing 0{log{smn£~^)) 
bits. 


1.2 Technical Overview 
1.2.1 Upper Bounds 

The Arbitrary Partition Model. Recall that the input matrix A G jg arbitrarily partitioned 

into s matrices A* G i.e., for i = 1,2, ... , s: A = hy sketching on the left and right 

by two small random sign matrices S G ^o(k/e^)xm ^ -jg^nxoik/e'^)^ with 0{sk‘^/e^) words of 

communication, the server can learn A = SAT which is a "sketch" of A. It suffices to compute 
the best rank-A: approximation A^ to A. Suppose the SVD of A^ = Then the server 

can learn ATV^^ by the second round of communication which needs 0(skm) words. We then 
prove (see Lemma 28): ||A — UU^A||| < (l-|-e)||A — Afc||| where U is an orthonormal basis of the 
column space of ATV^^. Notice that rankilJ) < k. Thus U is exactly what we want. Section 5.3 
shows the details. 

A technical obstacle in the analysis above is that it may require a large number of machine 
words to specify the entries of even if each entry in A is only a single word. In fact, one can 
show (we omit the details) that the singular values of A^ can be exponentially large in /c/e, which 
means one would need to round the entries of Vto an additive exponentially small precision, 
which would translate to an undesirable sm ■ poly (/c/e) bits of communication. 

To counter this, we use the smoothed analysis of Tao and Vu [52] to argue that if we add small 
random Bernoulli noise to A, then its minimum singular value becomes inverse polynomial in 
n. Moreover, if the rank of A is at least 2k and under the assumption that the entries of A are 
representable with a single machine word (so we can identify them with integers in magnitude 
at most poly(nms/e) by scaling), then we can show the additional noise preserves relative error 
approximation. To our knowledge, this is the first application of this smoothing technique to 
distributed linear algebra algorithms. Section 6 discusses the details of this approach. 
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On the other hand, if the rank of A is smaller than 2k, the smoothing technique need not pre¬ 
serve relative error. In this case though, we can learn a basis C G for the column span 

of A by multiplying by a pseudorandom matrix based on Vandermonde matrices. Since C has at 
most 2k columns, we can efficiently communicate it to all servers using 0{skm) communication. 
At this point we set up the optimization problem minj.ank(x)<fc || CXC^ A — A||f. The server cannot 
learn C^A and A directly, as this would be VL(nm) words, but we can choose additional "sketch¬ 
ing matrices" Tieft and Tright to instead solve mini.ank(x)<fc ||Tie/t(CXC^A - A)Tright\\F, which 
is a generalization of the subspace embedding technique to "affine embeddings" [53]. the server 
learns T le/tC^C^ AT right and TieftATright, which are small matrices, and can be sent to each ma¬ 
chine. Each machine locally solves for the rank-/c solution X* minimizing \\T leftCX.C'^ AT right — 
T le ft AT rightlW- Now each machine can find the same m x k orthonormal basis and output it 
without further communication. 

Sparse Matrices in the Column Partition Model. Our algorithm for sparse matrices is techni¬ 
cally simpler than that in the arbitrary partition model. Section 8.2 presents a distributed PCA 
algorithm that has 0{{4>kse~^) + poly(s/c/e)) words of communication cost. Our idea in order to 
take advantage of the sparsity in the input matrix A is to select and transfer to the coordinator 
a small set of "good" columns from each sub-matrix A,. Specifically, we design an algorithm 
that first computes, in a distributed way, a matrix C G with c = 0{ke~^) columns of A, and 
then finds U G span{C) using a distributed, communication-efficient algorithm developed in [37]. 
The matrix C is constructed using optimal algorithms for column sampling [17, 26, 24], extended 
properly to the distributed case. 

As mentioned, our algorithm actually solves the distributed column subset selection problem. 
It builds on results for column subset selection in the batch setting [17, 26, 24], but gives a novel 
analysis showing that in the distributed setting one can find columns as good as in the batch 
setting. To the best of our knowledge, only heuristics for distributed column subset selection were 
known [28]. We also optimize the time complexity of our algorithm, see Table 4. 

Turnstile streaming model. Our one-pass streaming PCA algorithm starts by generating two 
random sign matrices S G and R G gj-^own in Lemma 51, 

min IIARXSA - A||| <{l + e)- || A - Afc||| 
rank(x)<fc 

The intuition of the above is as the following. Suppose the SVD of A^ is UAi,XAj.V^^ where 
Afc is the best rank-Zc approximation matrix to A. Since X can be chosen as Saj,Vaj,, we have 
min,.a,jfc(x)<A: llUAfcX — A||| = ||Afc — A|||. As shown in [21], S can provide a sketch for the 
regression problem. Thus, ||Uaj,X — A||| < (1 -I- e) • FQm.rank{'x.)<k IIUa^X — A||| where 

X = arg min ||SUai,X — SAIIp 

rank(X)<k 

Notice that X is in the row space of span{SA). Then we focus on the regression problem: 

min ||XSA-A|||. 
rank(x)<fc 

Similarly, we can use R to sketch the problem. Thus finally we only need to optimize 

min IIARXSA-A|||. 
rank(x)<fc 
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Upper bounds 

Lower bounds 

Definition 18 

0(sk4>e 

(Theorem 74) 

Q(sk4>e 

(Theorem 90) 

Definition 19 

O ^sk4>e~^ + s ■ poly(fc/e)^ 

(Theorem 74) 

Q(sk4>e~^) 

(Corollary 91) 


Table 3: Communication upper/lower bounds for the CSSP problems of Definitions 18 and 19. 


However, it is too expensive to store the entire matrix A. Thus, we sketch on the left and right 
by Tieft and Tright- Since all of T^e/tAR, SAT,,ig,,t and TieftATright can be maintained in small 
space, X* = ||T/e/t(ARXSA—A)T,.ig/,t||| can be constructed after one pass over 

the stream. Additionally, if AR is also maintained in the algorithm, we can get U G ^ of which 

columns are an orthonormal basis of span (ARX*) satisfying 11A — UU^ A||2 <(l + ^).||A-Afc|||. 
Furthermore, if SA is also maintained, since it suffices to compute the SVD of X* = Ux* Xx* 

T = T^e/tARUx* and K = Vx^SAT,.jg/,t can be constructed in 0((n + m)k) words of space. 
Therefore, the algorithm can compute A^ = TXx*K = ARX*SA in 0((n + m)h.) words of 
space. The algorithm then can output A^ with total space 0{{n + m)k/e + poly(/c/e)) words (see 
Theorem 53). 

Our two-pass streaming PCA algorithm is just an implemententation of our distributed PCA 
algorithm in the streaming model. 

1.2.2 Lower Bounds 

Table 3 summarizes the matching communication lower bounds that we obtain for disfributed 
column-based mafrix reconstruction. Theorem 90 proves a lower bound for the problem in Defi¬ 
nition 18; then, a lower bound for the problem of Definition 19 follows immediately since this is a 
harder problem (see Corollalry 91). 

Distributed Column Subset Selection. Our most involved lower bound argument is in showing 
the tightness for the distributed column subset selection problem, and is given in Theorem 90. To 
illustrate the ideas, suppose k = 1. We start with the canonical hard matrix for column subset 
selection, namely, anm x m matrix A whose first row is all ones, and remaining rows are a subset 
of fhe identity matrix. Intuitively the best rank-1 approximation is very aligned with the first row 
of A. However, given only o(l/e) columns of the matrix, there is not a vector in the span which 
puts a (1 — e)-fraction of its mass on the first coordinate, so n(l/e) columns are needed. Such a 
matrix has (j) = 2. 

Obtaining a lower bound for general 4> involves creating a large net of such instances. Suppose 
we set 171 = 4) consider LA, where L is formed by choosing a random 4^4 orthonormal 
matrix, and then rounding each entry to the nearest integer multiple of l/poly(n). We can show 
that o(l/e) columns of LA do not span a rank-1 approximation to LA. We also need a lemma 
which shows that if we take two independently random orthonormal matrices, and round their 
entries to integer multiples of 1/poly(n), then these matrices are extremely unlikely to share a 
constant fraction of columns. The idea then is that if one server holds LA, and every other server 
holds the zero matrix, then every server needs to output the same subset of columns of LA. By 
construction of A, each column of LA is the sum of the first column of L and an arbifrary column 
of L, and since we can assume all servers know the first column of L (which involves a negligible 
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0{s4>) amount of communication), this implies that each server must learn Q.{l/e) columns of L. 
The probability that random discretized orthonormal matrices share ff (1 /e) columns is sufficiently 
small that we can choose a large enough net for these columns to identify L in that net, requiring 
Q{(j)/e) bits of communication per server, or fl(s(/)/e) words in total. The argument for general k 
can be viewed as a "direct sum theorem", giving Q{sk(l)/s) words of communication in total. 

Dense Matrices in the Column Partition Model. Our optimal lower bound for dense matrices 
is technically simpler. It gives an Q{skm) word communication lower bound in column partition 
model for dense matrices. Theorem 82 argues that there exists an m x n matrix A such that for this 
A, any k < 0.99m, and any error parameter C with 1 < C < poly(s/cm), if there exists a protocol to 
construct anmxk matrix U satisfying || A — UU^A||| < C • || A — Afc|||, with constant probability, 
then this protocol has communication cost at least Q.{skm) words. This lower bound is proven for 
a matrix A that has k fully dense columns. We should note that the lower bound holds even if 
each machine only outputs the projection of Aj onto U, i.e., UU^Aj, rather than U itself. Note 
that this problem is potentially easier, since given U, a machine could find UU^ A*. 

The intuition of the proof is that machine 1 has a random matrix Ai G with orthonormal 
columns (rounded to integer multiples of l/poly(n)), while all other machines have a very small 
multiple of the identity matrix. When concatenating the columns of these matrices, the best k- 
dimensional subspace to project the columns onto must be very close to Ai in spectral norm. 
Since machine i, i > 1, has tiny identity matrices, after projection it obtains a projection matrix 
very close to the projection onto the column span of Ai. By choosing a net oi m x k orthonormal 
matrices, all of which pairwise have high distance in spectral norm, each machine can reconstruct 
a random element in this net, which requires lots of information, and hence communication. 

Our lower bound of Theorem 82 implies a lower bound of Q,{sk(j)) words for matrices in which 
the column sparsity is 4> (see Corollary 83) as a simple corollary. 

1.3 Road Map 

In the first ten pages, we have one additional section. Section 2, which gives an overview of our 
algorithm in the arbitrary partition model. 

In the Appendix, we then present more detail on prior results on distributed PCA algorithms 
in Section 3. Section 4 introduces the notation and basic results from linear algebra that we use 
in our algorithms. Section 5 presents our distributed PCA algorithm for arbitrary matrices for 
Definition 16. The communication of the algorithm in this section can be stated only in terms of 
"real numbers". We resolve this issue in Section 6 where a modified algorithm has communication 
cost bounded in terms of machine words. Section 7 discusses space-optimal PCA methods in 
the streaming model of computation. Section 8 presents a distributed PCA algorithm for sparse 
matrices in column partition model, while Section 9 extends this algorithm to a faster distributed 
PCA algorithm for sparse matrices. Section 10 presents communication lower bounds. 

2 Outline of our result in the arbitrary partition model 

Here we give a brief outline of our result in the arbitrary patition model, deferring the full details 
to the appendix. 

We describe a fast distributed PCA algorithm with total communication 0{msk) words plus 
low order terms, which is optimal in the arbitrary partition model in the sense that an Q{msk) bit 
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lower bound was given by [37]. The algorithm employs, in a novel way, the notion of projection- 
cost preserving sketches from [24], In particular, whereas all previous [48, 23] dimension-reduction- 
based SVD methods reduce one dimension of the input matrix to compute some approximation to 
the SVD, our method reduces both dimensions and computes an approximation to the SVD from a 
small almost square matrix (we note that other work, such as work on estimating eigenvalues [8], 
uses sketches in both dimensions, but it is not clear how to compute singular vectors using such 
sketches). Unlike [37] which reduces only one dimension in the first communication round, we 
do the reduction on both dimensions in the same round. 

We first present the batch version of the algorithm which offers a new low-rank matrix approx¬ 
imation technique; a specific implementation of this algorithm offers a communication-optimal 
distributed PCA algorithm (we also discuss in Section 7 a variant of this algorithm that offers 
a two-pass space-optimal PCA method in the turnstile streaming model). Before presenting all 
these new algorithms in detail, we present the relevant results from the previous literature that 
we employ in the analysis. 

2.1 Projection-cost preserving sketching matrices 

In this section, we recap a notion of sketching matrices which we call "projection-cost preserving 
sketching matrices". A sketching matrix from this family is a linear matrix transformation and it 
has the property that for all projections it preserves, up to some error, the difference between the 
matrix in hand and its projection in Frobenius norm. 

Definition 9 (Projection-cost preserving sketching matrices). We say that W G is an (e, k)- 
proiection-cost preserving sketching matrix of A G if for all rank-k orthogonal projection matrices 

p g n satisfies 

(1-e)||A-PA||| < ||AW-PAW||| + c< (1 + e)|| A - PA||| 

where c is a non-negative constant which only depends on A and W. We also call AW an (e, kj-projection- 
cost preserving sketch of A. 

Due to the following lemma, we know thaf a good rank-A: approximation projection matrix of 
(e, /c)-projection-cost preserving sketch AW also provides a good rank-A: approximation to A. 

Lemma 10 (PCA via Projection-Cost Preserving Sketches - Lemma 3 in [24]). Suppose W G is 
an (e, k)-projection-cost preserving sketching matrix of A G Let P = argmin,.a,j;.(P)<fc ||AW — 

PAW||2. For allP,e' satisfying rankjP) < k,e' > 0, if || AW-PAW||| < (l-Fe')ll AW-P*AW||2, 

IIA — PA Up < —- ■ (1 -|- e^)||A — A^Hp 

[24] also provides several ways to construct projection-cost preserving sketching matrices. Be¬ 
cause we mainly consider the communication, we just choose one which can reduce the dimension 
as much as possible. Furthermore, it is also an oblivious projection-cost preserving sketching ma¬ 
trix. 

Lemma 11 (Dense Johnson-Lindenstrauss matrix - part of Theorem 12 in [24]). For e < 1, suppose 
each entry of MV G is chosen 0{log{k))-wise independently and uniformly in —l/^/ff} where 

f = 0{ke~‘^) [21], For any A G R™^"', with probability at least 0.99, W is an {e,k)-projection-cost 
preserving sketching matrix of A. 
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2.2 A batch algorithm for the fast low rank approximation of matrices 

In this section, we describe a new method for quickly computing a low-rank approximation to 
a given matrix. This method does not offer any specific advantages over previous such tech¬ 
niques [48,23,18]; however, this new algorithm can be implemented efficiently in the distributed 
setting (see Section 5.3) and in the streaming model of computation (see Section 7); in fact we 
are able to obtain communication-optimal and space-optimal results, respectively. For complete¬ 
ness as well as ease of presentation, we first present and analyze the simple batch version of the 
algorithm. The algorithm uses the dense Johnson-Lindenstrauss matrix of Lemma 27 in order 
to reduce both dimensions of A, before computing some sort of SVD to a poly (/c/e) x poly (/c/e) 
matrix (see Step 2 in the algorithm below). 

Consider the usual inputs: a matrix A G a rank parameter k < rank(A), and an accu¬ 
racy parameter 0 < e < 1. The algorithm below returns an orthonormal matrix U G such 

that 

||A - UU^AIll < (1 + e) • ||A - Afclll. 


Algorithm 

1. Construct two dense Johnson-Lindenstrauss matrices S G T G with C = 0{ke~^),^2 = 

0{ke~'^) (see Lemma 27). 

2. Construct A = SAT. 

3. Compute the SVD of Afc (U^, GR'=^^V^^ G 

4. Construct X = ATV^^ 

5. Compute an orthonormal basis U G fQj. spanpC) (notice that rankiyi) < k). 

Theorem 29 later in this section analyzes the approximation error and the running time of the 
previous algorithm. First, we prove the accuracy of the algorithm. 

Lemma 12. The matrix U G k orthonormal columns satisfies with probability at least 0.98; 

||A - UU^AIll <(! + £)• ||A - Afclll. 


Proof. 


A-A,||2 = ||A-AV, V 


\l = IISAT-SATVi 


II = iit^aTs^-Vi vl tTa^s^III 


The first equality follows by the SVD of A^ = . The second equality is by the 

construction of A = SAT. The third equality is due to VM, ||M||| = ||M^|||. 

Due to Lemma 27, with probability at least 0.99, is an {e, /c)-projection-cost preserving 
sketch matrix of T^A^. According to Lemma 26, 

||AT - ATV^^Vjjll = IItV - ^ . ||AT - (AT)fc||| (1) 


Observe that 


IIUU'^AT-ATllI = IIXX^AT-ATllI < ||XVl - AT||| = ||ATVi - AT||| 

II lU II lU — II Afc Hr H Afc Afc lir 

which is at most ■ ||AT - (AT)fc|||. 
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The first equality uses the fact that U is an orthonormal basis of span(X). The first inequality is 
followed by VX,M,N, ||XX^M — M||| < ||XN —M|||. The second equality uses the construction 
that X = ATV^^. The second inequality follows by Eqn (2). 

Due to Lemma 27, with probability at least 0.99, T is an (e, fe)-projection-cost preserving sketch 
matrix of A. Due to Lemma 26, 

||UU-A-Ai<|^.||A-A.i 

Due to union bound, the probability that is an (e, /c)-projection-cost preserving sketch matrix 
of A^ and T is an (e, A:)-projecfion-cosf preserving sketch matrix of A is at least 0.98. Note that 
is 1 -|- 0(e) when e is small enough, so we can adjust e here by a constant factor to show the 
statement. ■ 

Next, we present the main theorem. 

Theorem 13. The matrix U G ivith k orthonormal columns satisfies with probability at least 0.98; 

IIA - UU^AIll <(!+£)• IIA - Afclll. 

The running time of the algorithm is 

O (nmke~‘^ -|- mk^£~^ + poly(te“^)) . 

Proof. The correctness is shown by Lemma 28. The rurming time is analyzed in Section 5. 


2.3 The distributed PCA algorithm 

Recall that the input matrix A G partitioned arbitrarily as: A = Ylt i = I : s, 

Ai G The idea in the algorithm below is to implement the algorithm in Section 5.2 in the 

distributed setting. 

Input: 

1. A G arbitrarily partitioned A = A^ for i = 1 : s, A^ G 

2. rank parameter k < rank(A) 

3. accuracy parameter e > 0 

Algorithm 

1. Machines agree upon two dense Johnson-Lindenstrauss matrices S G T G with = 

0(k£~'^),^2 = 0(ks~‘^) (see Lemma 27). 

2. Each machine locally computes Ai = SA^T and sends Ai to the server. Server constructs A = ^^ Ai. 

3. Server computes the SVD of A^ = (Ua, G Sa, G Va^ G 

4. Server sends Va^^ to all machines. 

5. Each machine construct X^ = AiTVA^, and sends to the server. Server constructs X = ^ - X^. 

6. Server computes an orthonormal basis U G for span(X) (notice that ranfc(X) < k). 

7. Server sends U to each machine. 

Notice that in the first step, S and T can be described using a random seed that is 0(log(A:))- 
wise independent due to Lemma 27. 

The major remaining challenge here is to obtain small bit complexity, since this protocol could 
have very large bit complexity. See Section 5 in the Appendix for the full details of our algorithm. 
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3 Related Work 


Distributed PCA (or distributed SVD) algorithms have been investigated for a long time. One line 
of work, developed primarily within the numerical linear algebra literature, studies such algo¬ 
rithms from the perspective of parallelizing existing standard SVD algorithms without sacrificing 
accuracy. This approach aims at high accuracy implementations with the least possible commu¬ 
nication cost. The distributed models of computation go typically beyond the column-partition 
model and arbitrary-partition model that we study in this paper [46]. An extensive survey of this 
line of work is out of the scope of our paper; we refer the reader to [35, 50, 33] and references 
therein for more details as well as to popular software for distributed SVD such ScaLAPACK [14] 
and Elemental [46]. 

Another line of work for distributed PCA algorithms has emerged within the machine learning 
and datamining communities. Such algorithms have been motivated by the need to apply SVD or 
PCA to extremely large matrices encoding enormous amounts of data. The algorithms in this line 
of research are typically heuristic approaches that work well in practice but come with no rigorous 
theoretical analysis. We refer the reader to [47,42, 9,15] for more details. 

Finally, distributed PCA algorithms in the column-partition model have been recently stud¬ 
ied within the theoretical computer science community. Perhaps the more intuitive algorithm 
for PCA in this model appeared in [29, 10]: first, a number of left singular vectors and singular 
values are computed in each machine; then, the server collects those singular vectors and con¬ 
catenates them column-wise in a new matrix and then it computes the top k left singular vectors 
of this "aggregate" matrix. It is shown in [29, 10] that if the number of singular vectors and sin¬ 
gular values in the first step is 0{ke~^), then, the approximation error in Frobenius norm is at 
most (1 -t- s) times the optimal Frobenius norm error; the communication cost is 0{skme~^) real 
numbers because each of the s machines sends 0{ke~^) singular vectors of dimension m; unfortu¬ 
nately, it is unclear how one can obtain a communication cost in terms of words/bits. A different 
algorithm with the same communication cost (only in terms of real numbers since a word/bit 
communication bound remained unexplored) is implicit in [41, 31] (see Theorem 3.1 in [31] and 
the discussion in Section 2.2 in [41]). Karman, Vempala and Woodruff proposed arbitrary-partition 
model in [37]. They developed a (1 -|- e) Frobenius norm error algorithm with communication cost 
O [skme~^ + sk‘^s~^) words. Bhojanapalli, Jain, and Sanghavi in [13] developed an algorithm 
that provides a bound with respect to the spectral norm. Their algorithm is based on sampling 
elements from the input matrix, but the communication cost is prohibitive if n is large. The cost 
also depends on the condition number of A. Moreover, to implement the algorithm, one needs to 
know IIA — Afcllp. Finally, [39] discussed a distributed implementation of the "orthogonal itera¬ 
tion" to compute eigenvectors of graphs. The model they consider is different, but perhaps their 
protocol could be extended to our model. 

4 Preliminaries 

Definitions. We formally define the problems and the models here. 

Definition 14 (Arbitrary-partition model). An mxn matrix A is arbitrarily partitioned into s matrices 
Ai G i.e.,for i = 1,2,... ,s: A = Yli=i There are s machines, and the i-th machine has Aj as 

input. There is also another machine, to which we refer to as the "server", which acts as the central coor¬ 
dinator. The model only allows communication between the machines and the server. The communication 
cost of an algorithm in this model is defined as the total number of words transferred between the machines 
and the server, where we assume each word is 0{log{nms/e)) bits. 
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Reference 

l|A 

- UU‘A||2 < 

||A-UU‘A||)(, < 

5 

Communication cost 

Total number of arithmetic operations 

Implicit in [29] 

- 

(1 + e) IIA — lip 

0 

- 

0 ^mn min{m, n} + mske~^ min{m, sfce” ^ j 

Theorem 2 in [10] 

- 

(1 + e) IIA — Afc lip 

0 

- 

0 1 mn min{Tn, n} + msk£~^ min{m, sfce“ ^ J 

Theorem 6 in [10] 

- 


> 0 

- 


Implicit in [41, 31] 

- 

(l-l-e)IIA-AfcllJ 

0 



Thm 1.1 in [37]”^ 

- 

(1 + e) IIA — Afc lip 

0(1) 


0{poly{m, n, k, s, 

Thm 5.1 in [13] 

IIA 

- Afclla + r 

- 

> 0 


0 /nnz(A) + 5~^nk^ £~^ ai(A.)a'^'‘^ (A.)\ 

Remark p. 11 [13] 

IIA 

- Afclla + r 

- 

> 0 


0 /nnz(A) + (5“^nfe^e“^cr^(A)(T^^(A)j 

Theorem 43* 

- 

(1 + e) IIA — Afc lip 

0(1) 

0{skm-\-sk^e 

0 1 mnfce~^ + 7nk‘^£~‘^ + poly{ke~ ^ ) ) ■ 

Theorem 63 

- 

(1 + e)||A — Afc Ip 

0(1) 

0{sk,j,e~^ + sk'^e~*) 

0 (mns • poly(k, 1/e)) 

Theorem 74 

- 

(1 + e) IIA — A^ lip 

0(1) 

0{sk4>e~^ sk^e~^) 

0 (nnz(A) ■ log^(l|S) + (m + n) ■ s ■ poly(^ log(l^)). 


Tabl6 4: Distributed PCA Algorithms in the column-partition model with s machines (see Definition 15). * indicates that the 
algorithm can also be applied in the arbitrary partition model (see Definition 14). A S has rank p, U S R’"^'= with fc < p is 

orthonormal, 0 < £ < 1, and each column of A contains at most (p < m non-zero elements. S is the failure probability. Finally, for 

notational convenience let F := £|| A — Aj.||f, A := ctj (A)(t^^(A) log^ ^|| A|| 2 || A — Aj.||p^£“^^. 


Definition 15 (Column-partition model). An m x n matrix A is partitioned arbitrarily column-wise 
into s blocks A* G ^ i.e.,for i = 1,2,..., s; A = (Ai A 2 ... A^) . Here, Y^Wi = n. There are 

s machines, and the i-th machine has A* as input. There is also another machine, to which we refer to as the 
"server”, which acts as the central coordinator. The model only allows communication between the machines 
and the server. The communication cost of an algorithm in this model is defined as the total number of words 
transferred between the machines and the server, where we assume each word is 0{log{nms/e)) bits. 

Definition 16 (The Distributed Principal Component Analysis Problem in arbitrary partition model). 
Give anmxn matrix A arbitrarily partitioned into s matrices A* G : 1 ^ 2,..., s); A = Yi=i 

a rank parameter k < rank(A), and an accuracy parameter 0 < e < 1, design an algorithm in the model 
of Definition 14 which, upon termination, leaves on each machine a matrix U G with k orthonormal 

columns such that || A — UU"''A||| < (1 + e) • || A — Afc|||, and the communication cost of the algorithm 
is as small as possible. 

Definition 17 (The Distributed Principal Component Analysis Problem in column partition model). 
Given an m x n matrix A partitioned column-wise into s arbitrary blocks A* G (i : 1,2,..., s); 

A = (Ai A 2 ... As) , a rank parameter k < rank(A), and an accuracy parameter 0 < e < 1, de¬ 
sign an algorithm in the model of Definition 15 which, upon termination, leaves on each machine a matrix 
U G with k orthonormal columns such that ||A — UU"^A||p < (1 -|- e) • || A — Afc|||, and the 

communication cost of the algorithm is as small as possible. 

Definition 18 (The Distributed Column Subset Selection Problem). Given an m x n matrix A par¬ 
titioned column-wise into s arbitrary blocks A* G (i : 1,2,..., s); A = (Ai A 2 ... As) , a 

rank parameter k < rank(A), and an accuracy parameter 0 < e < 1, design an algorithm in the model of 
Definition 15 that, upon termination, leaves on each machine a matrix C G with c < n columns of 
A such that || A — CC^A||| < (1 -I- e) • || A — Afc|||, and 

1. The number of selected columns c is as small as possible. 

2. The communication cost of the algorithm is as small as possible. 

Definition 19 (The Distributed Column Subset Selection Problem - rank k subspace version). 
Given an m x n matrix A partitioned column-wise into s arbitrary blocks A* G (i ; 1,2,..., s): 

A = (Ai A 2 ... As) , a rank parameter k < rank(A), and an accuracy parameter 0 < e < 1, 
design an algorithm in the model of Definition 15 that, upon termination, leaves on each machine a ma¬ 
trix C G with c < n columns of A and a matrix U G with k orthonormal columns with 

U G span{C), such that ||A — CC^A||p < ||A — UU"^A||| < (1 -)- e) • ||A — A^Hp, and 
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1. The number of selected columns c is as small as possible. 

2. The communication cost of the algorithm is as small as possible. 

Definition 20 (Streaming model for Principal Component Analysis). Let all the entries in A G 
initially be zeros. In the streaming model of computation, there is a stream of update operations that the 
gth operation has form {iq,jq,Xq) which indicates that Aj^ j^ should be incremented by Xq where iq G 
{1, ...,m},jq G {1, ...,n},Xq G M. An algorithm is allowed a single pass over the stream. At the end of 
the stream the algorithm stores some information regarding A which we call a "sketch" of A. The space 
complexity of an algorithm in this model is defined as the total number of words required to describe the 
information the algorithm stores during the stream including the sketch. Each word is O (log (nms/e)) 
bits. 

Definition 21 (The Streaming Principal Component Analysis Problem). Given an m x n matrix A, 
a rank parameter k < rank(A), and an accuracy parameter 0 < e < 1, design an algorithm that, using 
as little space as possible, first finds a sketch of A in the streaming model (see Definition 20) and then, 
using only this sketch outputs U G with k orthonormal columns such that IIA — UU"''A||p < 

(l + e)-||A-Afc||2. 

Definition 22 (The Streaming Principal Component Analysis Problem (factorization)). Given an 
m X n matrix A, a rank parameter k < rank(A), and an accuracy parameter 0 < e < 1, design an 
algorithm that, using as little space as possible, first finds a sketch of A in the streaming model (see Defini¬ 
tion 20) and then, using only this sketch outputs A^ G R™xn rank(Af) < k such that ||A —A|.||| < 
(1 + e) • II A — Afc|||. 

Notation. A, B,... are matrices; a, b,.. . are column vectors. In is the n x n identity matrix; 
Omxn is the m X n matrix of zeros; In is the n x 1 vector of ones; e* is the standard basis (whose 
dimensionality will be clear from the context): the ith element of e* is one and the rest are zeros. 
A*^*) and A(j) denotes the zth column and jth row of A, respectively. Ajj is the (i, j)th entry in A. 

Sampling Matrices. Let A = [A(i),...,A(")] g and let C = [A(*i), ..., A^*")] G 

consist of c < n columns of A. Note that we can write C = AD, where the sampling matrix is 
D = hi,..., ej^] G R”^'^ (here are the standard basis vectors in R”). If D G is a diagonal 
matrix, then ADD contains c columns of A rescaled with the corresponding elements in D. We 
abbreviate S := DD, hence the matrix S G R”^"^ "samples" and "rescales" c columns from A. 

Matrix norms. We use the Frobenius and the spectral matrix-norms: ||A||| = j Afy, ||A ||2 = 
max||x|| 2 =i II Ax|| 2 . || A||^ is used if a result holds for both norms ^ = 2 and ^ = F. The standard 
submultiplica tivity property for matrix norms implies that for any A and B: ||AB||^ < ||A||^-||B||^. 
The triangle inequality for matrix norms implies that ||A-|-B||^ < ||A||^-|- ||B|h A version of the 
triangle inequality for the norms squared is: ||A-|-B||| < 2 - ||A|||-|-2 - ||B|||. A version of the 
matrix pythagorean theorem is: if A^B is the all-zeros matrix, then ||A-|-B||p = ||A||p-|- ||B|||. 
If V has orthonormal columns, then ||AV^||^ = ||A|h for any A. If P is a symmetric projection 
matrix (i.e., P = P^ and P^ = P) then, ||PA||^ < ||A|h for any A. 
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Singular Value Decomposition (SVD) and Moore-Penrose Pseudo-inverse. The SVD of A G 

]^mxn ]-ank(A) = /9 is 


A = ( Ufc ^p-k 

' -V- 


Sfc 0 \ / Vj \ 


with singular values <ti > ... <Tfc > Cfc+i > ... > Up > 0. Here, k < p. We will use Uj (A) to denote 
the i-th singular value of A when the matrix is not clear from the context. The matrices G 
I^mxfc Up_fc G contain the left singular vectors of A, and, similarly, the matrices 

Vfc G and Vp_fc G contain the right singular vectors of A. It is well-known that 

Afc = = AVfcV^ = UfcU^A minimizes || A — X||g over all matrices X G of rank 

af most k. Specifically, || A — A^lH = and || A — Afc||| = Yli^i=k+i [32]). A^ = 

G denotes the so-called Moore-Penrose pseudo-inverse of A G R™^"^ (here 

is the inverse of Xa). By the SVD of A and A^, for all i = 1,... ,p = rank(A) = rank(A^): 
(Ti(A^) = l/fTp_i+l(A). 


4.1 The best rank k matrix 11^ f.{A) within a subspace V 


Our analysis uses theory involving computing the best rank k approximation of a matrix A within 
a given column subspace V. Let A G let A: < n be an integer, and let V G with 

k < c < n. G R™^"^ is the best rank k approximation to A in the column span of V. 

Equivalently, we can write 

n^,,(A) = vxopt, 


where 


Xopt = argmin ||A-VX|||. 
xeR‘'’^’":rank(x)<fc 


In order to compute IIy fc(A) given A, V, and k, one can use the following algorithm: 


1: V = is a gr decomposition of V with Y G and 4^ G R'^^'^. This step requires O(mc^) 
arithmetic operations. 

2: 3 = Y^A G This step requires 0{mnc) arithmetic operations. 

3: 3fc = AXfcVj; G is a rank k SVD of 3 with A G G and V*. G This 

step requires O(nc^) arithmetic operations. 

4: Return YA A^Y^A G of rank at most k. 


Notice that YAA^Y^A G R™^"^ is a rank k matrix that lies in the column span of V. The next 
lemma is a simple corollary of Lemma 4.3 in [21]. 

Lemma 23. Given A G V G R™^"^ and an integer k, YAA^^Y"^ and QUfcXfcV]; satisfy: 

||A - YAATy^AIII < ||A - YAXfcV^lll = ||A - <,fc(A)||| 

The above algorithm requires 0{mnc + ncf ) arithmetic operations to construct Y, 4^, and A. We will 
denote the above procedure as [Y, 4^, A] = BestColumnSubspaceSVD{A, V, k). 

Proof. The equality was proven in Lemma 4.3 in [21]. To prove the inequality, notice that for any 
matrix X : || A — YA(YA)^A||| < || A — YAX|||. Also, (YA)^ = A^Y^ = A^^Y"^, because both 
matrices are orthonormal. ■ 
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Next, we state a version of the above lemma for the transpose of A, equivalently for the best 
rank k approximation within the row space of a given subspace R. Let A G let /c < n be an 

integer, and let R G with k < c < m. 11^ ^(A) G R™^"^ is the best rank k approximation to A 
in the row span of R. Equivalently, we can write 

~ ^optR) 


where 


Xopf = argmin ||A-XR|||. 
xgKmxc.rank{x)<A: 


In order to compute 11^ A) given A, R, and k, one can use the following algorithm: 


1: R^ = YZ is a qr decomposition of R^ with Y G and Z G This step requires O(nc^) 
arithmetic operations. 

2: 3 = AY G This step requires 0{mnc) arithmetic operations. 

3: 3fc = G is a rank k SVD of 3 with Ufc G Sfc G and A G 

This step requires 0{mc^) arithmetic operations. 

4: Return AY A Y^ G of rank at most k. 


Notice that AY A A^ Y^ G is a rank k matrix that lies in the row span of R. The next lemma 

is a corollary of Lemma 23 when applied with A := A^ and V := R^. 

Lemma 24. Given A G R G R^^"" and an integer k, YAA^^Y"^ and UfcSfeA"*’ satisfy: 

||A - AYAATy^III < ||A - UfeEfeA'^RllI = ||A - n^_fc(A)|||. 

The above algorithm requires 0{mnc + mcf) arithmetic operations to construct Y, Z, and A. We will 
denote the above procedure as [Y, Z, A] = BestRowSubspaceSVD{A, R, k). 


5 Distributed PCA in the arbitrary partition model 

This section describes a fast distributed PCA algorithm with total communication 0{msk) words 
plus low order terms, which is optimal in the arbitrary partition model in the sense that an 
Q{msk) bits lower bound was given by [37]. The algorithm employs, in a novel way, the notion of 
projection-cost preserving sketches from [24]. In particular, whereas all previous [48, 23] dimension- 
reduction-based SVD methods reduce one dimension of the input matrix to compute some ap¬ 
proximation to the SVD, our method reduces both dimensions and computes an approximation to 
the SVD from a small almost square matrix. Unlike [37] which reduces only one dimension in the 
first communication round, we do the reduction on both dimensions in the same round. 

We first present the batch version of the algorithm which offers a new low-rank matrix approx¬ 
imation technique; a specific implementation of this algorithm offers a communication-optimal 
distributed PCA algorithm (we also discuss in Section 7 a variant of this algorithm that offers 
a two-pass space-optimal PCA method in the turnstile streaming model). Before presenting all 
these new algorithms in detail, we present the relevant results from the previous literature that 
we employ in the analysis. 
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5.1 Projection-cost preserving sketching matrices 

In this section, we recap a notion of sketching matrices which we call "projection-cost preserving 
sketching matrices". A sketching matrix from this family is a linear matrix transformation and it 
has the property that for all projections it preserves, up to some error, the difference between the 
matrix in hand and its projection in Frobenius norm. 

Definition 25 (Projection-cost preserving sketching matrices). We say that W G Iq an {s, k)- 
proiection-cost presewino; sketching matrix of A G if for all rank-k orthogonal projection matrices 

P ^ n satisfies 

(1-e)||A-PA||| < ||AW-PAW||| + c< (1 A e)|| A - PA||| 

where c is a non-negative constant which only depends on A and W. We also call AW an (e, k)-proiection- 
cost preserving sketch of A. 

Due to the following lemma, we know that a good rank-Zc approximation projection matrix of 
(e, /c)-projection-cost preserving sketch AW also provides a good rank-A: approximation to A. 

Lemma 26 (PCA via Projection-Cost Preserving Sketches - Lemma 3 in [24]). Suppose W G is 
an [e, kYprojection-cost preserving sketching matrix of A G R™^”. Let P* = argmin,.,j,^;i.(P)<^ ||AW — 
PAW|||. For all P,e’satisfying rankfP) <k,e’ > 0, if || AW-PAW||| < (l+e')|| AW-P*AW|||, 

||A — PA||| < -—- • (1 + e^)||A — Afclll 

[24] also provides several ways to construct projection-cost preserving sketching matrices. Be¬ 
cause we mainly consider the communication, we just choose one which can reduce the dimension 
as much as possible. Furthermore, it is also an oblivious projection-cost preserving sketching ma¬ 
trix. 

Lemma 27 (Dense Johnson-Lindenstrauss matrix - part of Theorem 12 in [24]). For e < 1, suppose 
each entry ofW G R”^^ is chosen 0{log{k))-wise independently and uniformly in where 

f = 0{ke~Y [21], For any A G with probability at least 0.99, W is an {e,k)-projection-cost 

preserving sketching matrix of A. 

5.2 A batch algorithm for the fast low rank approximation of matrices 

In this section, we describe a new method for quickly computing a low-rank approximation to 
a given matrix. This method does not offer any specific advantages over previous such tech¬ 
niques [48,23,18]; however, this new algorithm can be implemented efficiently in the distributed 
setting (see Section 5.3) and in the streaming model of computation (see Section 7); in fact we 
are able to obtain communication-optimal and space-optimal results, respectively. For complete¬ 
ness as well as ease of presentation, we first present and analyze the simple batch version of the 
algorithm. The algorithm uses the dense Johnson-Lindenstrauss matrix of Lemma 27 in order 
to reduce both dimensions of A, before computing some sort of SVD to a poly(/c/e) x poly (/c/e) 
matrix (see Step 2 in the algorithm below). 

Consider the usual inputs: a matrix A G a rank parameter k < rank(A), and an accu¬ 

racy parameter 0 < e < 1. The algorithm below returns an orthonormal matrix U G R"^^^ such 
that 

||A - UU^AIll < (1 + e) • ||A - Afclll. 


Algorithm 
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1. Construct two dense Johnson-Lindenstrauss matrices S G T G with = 0{ke ^), ^2 = 

0{k£~'^) (see Lemma 27). 

2. Construct A = SAT. 

3. Compute the SVD of Afc = (U^, G G G 

4. Construct X = ATV^^ 

5. Compute an orthonormal basis U G for spanpC) (notice that rankCX.) < k). 

Theorem 29 later in this section analyzes the approximation error and the running time of the 
previous algorithm. First, we prove the accuracy of the algorithm. 

Lemma 28. The matrix U G with k orthonormal columns satisfies with probability at least 0.98; 

IIA - UU^AIll <(!+£)• IIA - Afclll. 

Proof. 

A-Afc 1= A-AVx 1= SAT-SATVx 1= T^A^S^-V. T^A^S^ | 

II Klir II Afc AfcllJ^ II Afc Afclli* II Afc Afc Hi" 

The first equality follows by the SVD of A^ = The second equality is by the 

construction of A = SAT. The third equality is due to VM, ||M||| = ||M^|||. 

Due to Lemma 27, with probability at least 0.99, is an (e, /c)-projection-cost preserving 
sketch matrix of T^A^. According to Lemma 26, 

||AT - ATV^^Vjjll = IItV - ^tVIII < ^ . ||AT - (AT)fc||| (2) 

Observe that 

IIUU'^AT - AT||| = IIXX^AT - AT||| 

< ||XV^ -AT||| 

■^k 

= IIATVi -AT||| 

" Afc Afc IIF 

< ^.||AT-(AT)fc||2 

The first equality uses the fact that U is an orthonormal basis of span(X). The first inequality is 
followed byVX,M,N, IIXX^M - M||| < ||XN — M|||. The second equality uses the construction 
that X = ATV^^. The second inequality follows by Eqn (2). 

Due to Lemma 27, with probability at least 0.99, T is an (e, /c)-projection-cost preserving sketch 
matrix of A. Due to Lemma 26, 

||UUTA-A|||< ji±^.||A-Afc||| 

(1 -e) 

Due to union bound, the probability that is an (e, /c)-projection-cost preserving sketch matrix 
of A^ and T is an (e, A:)-projection-cost preserving sketch matrix of A is at least 0.98. Note that 
is 1 + 0(e) when e is small enough, so we can adjust e here by a constant factor to show the 
statement. ■ 

Next, we present the main theorem. 
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Theorem 29. The matrix U G with k orthonormal columns satisfies with probability at least 0.98; 

IIA - UU^AIll < (1 + e) • IIA - Afclll. 

The running time of the algorithm is 

O {nmke~‘^ + mk^e~^ + poly(te“^)) . 

Proof. The correctness is shown by Lemma 28. 

Running time. Next, we analyze the running time of the algorithm: 

1. There are a total of (^i x m + ^2 x n) entries of S and T. It is enough to generate them in 
0{{n + m)k£~^) operations. 

2. We first compute AT with 0(mn^2) arithmetic operations. Then, we compute SAT with 
0 {^im^ 2 ) arithmetic operations. 

3. This step requires 0(poly(A:/e)) operations since we compute the SVD of a 0(poly(A:/e)) x 
0(poly(/c/e)) matrix A. 

4. We already have AT from the second step. Hence, 0{m^2k) additional arithmetic operations 
suffice to compute ATV 

5. 0{mk‘^) operations suffice to compute an orthonormal basis for X, e.g., with a QR factoriza¬ 
tion. 

■ 

5.3 The distributed PCA algorithm 

Recall that the input matrix A G partitioned arbitrarily as: A = i = I : s, 

Ai G The idea in the algorithm below is to implement the algorithm in Section 5.2 in the 

distributed setting. 

Input: 

1. A G arbitrarily partitioned A = A^ for i = 1 : s, A^ G 

2. rank parameter k < rank(A) 

3. accuracy parameter e > 0 

Algorithm 

1. Machines agree upon two dense Johnson-Lindenstrauss matrices S G T G with = 

0{k£~'^),^2 = 0{k£~‘^) (see Lemma 27). 

2. Each machine locally computes Ai = SA^T and sends Ai to the server. Server constructs A = Ai. 

3. Server computes the SVD of A^ = G G G 

4. Server sends Vi to all machines. 

Afc 

5. Each machine construct = A^TV i and sends to the server. Server constructs X = N - X^. 

6. Server computes an orthonormal basis U G R™^^ for span(X) (notice that ranfc(X) < k). 

7. Server sends U to each machine. 

Notice that in the first step, S and T can be described using a random seed that is 0(log(A:))- 
wise independent due to Lemma 27. 
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5.3.1 Main result 

The theorem below analyzes the approximation error, the communication complexity, and the 
running time of the previous algorithm. Notice that the communication cost of this algorithm 
is only given in terms of "real numbers". The only step where we can not bound the length of 
a machine word is when the server communicates V; to all machines; and this is because the 

■^k 

entries of Vcould be unbounded (see the discussion regarding the upper bounds in Section 1.2). 
We resolve this issue in the following section. 

Theorem 30. The matrix U G u)ith k orthonormal columns satisfies w.p. 0.98; 

IIA-UU^AIll < (l+e). ||A-Afclll. (3) 

The communication cost of the algorithm is 

0{msk + s ■ poly(/c/e)) 

"real numbers" and the running time is of the order 

O {nmke~‘^ + mk^e~^ + poly(te“^)) . 

Proof. The matrix U - up to the randomness in the algorithm - is exactly the same matrix as in the 
batch algorithm in Section 5.2, hence Theorem 29 proves Eqn. 3. 

The algorithm communicates 0{msk + s ■ poly(A:/£)) real numbers in total: 0{s ■ poly(/c/e)) in 
steps 2 and 4, and 0{smk) in steps 5 and 7. 

The operations in the algorithm are effectively the same operations as in the batch algorithm 
in Section 5.2, hence the analysis of fhe running time in Theorem 29 shows the claim. ■ 

6 Obtaining bit complexity for the distributed PCA algorithm 

For the algorithm in the previous section, we were only able to provide a communication upper 
bound in terms of "real numbers". In this section, we describe how to obtain a communication 
upper bound in terms of words for the above protocol, where each word is 0{\og{mnsk/e)) bits. 

The basic idea is that we have a case analysis depending on the rank of the matrix A. If the 
rank of A is less than or equal to 2k, we follow one distributed protocol and if the rank is at 
least 2k we follow a different protocol. In Section 6.1, Section 6.2, and Section 6.3 we describe 
the algorithm that tests the rank of a distributed matrix, and the two PCA protocols, respectively. 
Then, in Section 6.4 we give the details of the overall algorithm and in Section 6.5 we give its 
analysis. 

6.1 Testing the rank of a distributed matrix 

Lemma 31. Given A G ^ parameter k < rank(A), there exists a distributed protocol 

in the arbitrary partition model to test if the rank of A is less than or equal to 2k using 0{sk‘^) words of 
communication and succeeding with probability 1 — 6 for an arbitrarily small constant 5 > 0. 

Proof This is an immediate implementation of a streaming algorithm due to [21] for testing if an 
n X n matrix A has rank at least 2k in the streaming model, using 0{k‘^) words of space. In that 
algorithm, there is a fixed 6nk/d x n matrix H whose entries are integers of magnitude at most 
poly(n), where (5 > 0 is an arbitrarily small constant. The algorithm simply chooses Ak random 
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rows from H. Letting H' be the 2k x n matrix of the first 2k random rows, and H'' be the n x 2k 
matrix whose columns are the next 2k randomly chosen rows, the algorithm just declares that A 
has rank at least k iff the rank of H'AH" is 2k. 

The above streaming algorithm can be implemented in the distributed setting by having the 
coordinator choose 4/c random rows of the fixed, known matrix, and send the row identities to 
each of the machines. This only takes 0{sk) words of communication. Then machine i computes 
H'AjH", and returns this to the coordinator. The coordinator can then add these up to compute 
H'AH" and compute its rank. The total communication is 0{sk^) words and the protocol succeeds 
with probability at least 1 — 6 for an arbifrarily small constant 5 > 0. Note that we can assume our 
input matrix A, which is m x n, is a square n x n matrix by padding with all-zeros rows. Those 
rows of course, will never get communicated in the implementation described above. ■ 

6.2 Distributed PCA protocol when rank(A) < 2k 

6.2.1 Subsampled Randomized Hadamard Transform and Affine Embeddings 

Our algorithms use the following tool, known as "Subsampled Randomized Hadamard Trans¬ 
form" or SRHT for short, to implement efficiently fast dimension reduction in large matrices. 

Definition 32 (Normalized Walsh-Hadamard Matrix). Fix an integer m = 2P,for p = 1,2, 3,.... The 
(non-normalized) m x m matrix of the Walsh-Hadamard transform is defined recursively as, 


Hn = 

Hm/2 

Hm/2 

, with 

H 2 = 

■ +1 +1 ■ 


Hm/2 

Hm/2 



. +1 -1 . 


The mx m normalized matrix of the Walsh-Hadamard transform is equal to H = m 2 G 

Definition 33 (Subsampled Randomized Hadamard Transform (SRHT) matrix). Fix integers ^ and 
m = 2P with ^ < m and p = 1,2, 3,.... An SRHT matrix is an ^ x m matrix of the form 



• D G jg d random diagonal matrix whose entries are independent random signs, i.e. random 

variables uniformly distributed on {±1}. 

• H G is a normalized Walsh-Hadamard matrix (see Definition 32). 

• R G is a subset or r rows from thenxn identity matrix, where the rows are chosen uniformly 
at random and without replacement. 

The next lemma argues that an SRHT matrix is a so-called "affine embedding matrix". The 
SRHT is one of the possible choices of Lemma 32 in [23] that will satisfy the lemma. 

Lemma 34 (Affine embeddings - Theorem 39 in [23]). Suppose G and H are matrices with m rows, 
and G has rank at most r. Suppose T is a ^ x m SRHT matrix (see Definition 33) with ^ = 0(rje^). 
Then, with probability 0.99, for all X simultaneously: 

(1 - e) • ||GX - H||| < ||T(GX - H)||| < (1 + e) • ||GX - H|||. 

Finally, we note that matrix-vector multiplications with SRHT's are fast. 

Lemma 35 (Fast Matrix-Vector Multiplication, Theorem 2.1 in [6]). Given x G R™ and ^ < m, one 
can construct T G R^^™' and compute Tx in at most 2mlog2(^ -I-1)) operations. 
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6.2.2 Generalized rank-constrained matrix approximations 

Let M G N G L G and /c < c, r be an integer. Consider the following optimiza¬ 

tion problem, 

Xopt G argmin ||M — NXL||p. 

xeK'=><’',rank(X)<fc 

Then, the solution Xopt G with rank(Xopt) < k that has the minimum ||Xopt||F out of all 

possible feasible solutions is given via the following formula, 

Xopt = Nt (uNUj,MVLVT)^Lt. 

(UNUjfMVLVL)^ G of rank at most A: denotes the best rank A: matrix to UnUnMVlVl G 
]^mxn 'pj.jjg result was proven in [30] (see also [49] for the spectral norm version of the problem). 

6.2.3 The PCA protocol 

Lemma 36. Suppose the rank p of A G R™^"- satisfies p < 2k, for some rank parameter k. Then, there is a 
protocol for the Distributed Principal Component Analysis Problem in the arbitrary partition model using 
0{smk -I- sk‘^/e‘^) words of communication and succeeding with probability 1 — 6 for an arbitrarily small 
constant <5 > 0. 

Proof. The nx2k matrix H" chosen in the protocol of Lemma 31 satisfies that with probability 1 — 6, 
for an arbitrarily small constant <5 > 0, the rank of AH'' is equal to the rank of A if p < 2k. Indeed, 
if this were not true, the algorithm could not be correct, as the rank of H'AH" is at most the rank 
of AH" (and the same algorithm can be used for any p < 2k). It follows that with probability 1 — 6, 
the column span of AH" is equal to the column span of A. Hence, as in the protocol of Lemma 
31, the coordinator learns the column space of A, which can be described with 0{km) words. The 
coordinator thus communicates this to all machines, using 0{skm) total words of communication, 
assuming the entries of A are integers of magnitude at most poly (mns/e). 

Let C = AH", which is m x 2k. We can set up the optimization problem: 

min IICXC^A- A||f. (4) 

rank(x)<fc 

Because the size of C is only m x 2k, every machine can know C by sending a total of 0{smk) 
words. Since the rank of C and A are small, we can sketch on the left and right using affine 
embeddings Tieft and Tright- Then machine i sends AiTright to the coordinator, together with 
TieftAiTnght. The coordinator computes Cf^AT^ig^t and Tif.ftATright by adding up the sketches, 
and sends these back to all the machines. Each machine can then solve the optimization problem 

niin W'TieftC'X.C^ AT right — Tie ft AT right\\¥, (5) 

rank(X)<A: 

each obtaining the same X* which is the optimal solution to Eqn (5). Due to Lemma 34, X* is a 
(1 -|- 0(e))-approximation to the best solution to Eqn (4). Einally, every machine outputs the same 
orthonormal basis U G R”^^^ for CX*. 

We can construct affine embedding matrices T^e/t G and Tright £ R"^^^ with = 

Ofkje^), ^2 = 0{k/£^) (see Definition 33). The total communication of this protocol is 0{skm -|- 
sk'^js'^) words and the success probability can be made 1 — 5 for an arbitrarily small constant d > 0. 
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6.3 Distributed PCA protocol when rank(A) > 2k 

The idea here is more involved than in the previous subsection and in order to describe the algo¬ 
rithm we need several intermediate results. 

6.3.1 Lower bounds on singular values of matrices with integers entries 

The first lemma gives a lower bound on the singular values of a matrix with integer entries with 
bounded magnitude. 

Lemma 37. (Lemma 4.1 of [21], restated) Ifanmxn matrix A has integer entries bounded in magnitude 
by 7 , and has rank p = rank( A), then the k-th largest singular value A satisfies 

(Tk > (mn72)-^/(2(/’-^)). 

Proof. In the proof of Lemma 4.1 of [21], equation (10), it is shown that if is the fc-th largest 
eigenvalue of A^ A, then 

\k > 

This implies the /c-th singular value of A satisfies ■ 

Next, we state two immediate corollaries of this lemma for future reference. 

Corollary 38. If an m x n matrix A has integer entries bounded in magnitude by poly(mns/e) and 
IIA — Afcllp > 0, then 

||A — A^IIf > {mns/ 

Proof. Since || A — A^Hf > 0, the rank p of A is at least A: -|-1, so by the preceding lemma, 

||A - Afcllp > CTk+i > {po\y{mns/, 


as desired. ■ 

Corollary 39. If an m x n matrix A has integer entries bounded in magnitude by poly(mns/e) and 
rank{A) > 2k, then 

||A — A^IIf > 1/poly(mns/e). 

Proof. This follows by plugging p = 2k into Lemma 37. ■ 

6.3.2 Lower bounds on singular values of integer-perturbed matrices 

In this section, we describe a perturbation technique for matrices and provide lower bounds for 
the smallest singular value of the perturbed matrix. We start with a theorem of Tao and Vu. 

Theorem 40. (Theorem 2.5 of [52 ]) Let M.be annxn matrix with integer entries bounded in magnitude by 
for a constant C > 0. Let N„ be a matrix with independent entries each chosen to be 1 with probability 
1/2, and —1 with probability 1/2. Then, there is a constant B > 0 depending on C,for which 

Pr[||(M + N,,)-i ||2 > n^] < 1/n. 

In words, the result indicates that the spectral norm of the inverse of the perturbed matrix is 
bounded from below with high probability. We now describe a simple corollary of this result. 


26 



Corollary 41. Let M.be ann x n matrix with integer entries bounded in magnitude by for a constant 
C > 0. Let N„ be a matrix with independent entries each chosen to be 1/n^ with probability 1/2, and 
— 1/n^ with probability 1/2, where D > 0 is a constant. Then, there is a constant B > 0 depending on C 
and D for which 

Pr[||(M + N,,)-i ||2 >n^] < 1/n. 

Proof. This follows by Theorem 40 after replacing the constat C in that theorem with C + D, and 
scaling by n^. ■ 

We need to generalize Corollary 41 to rectangular matrices since we will eventually apply this 
perturbation technique to the matrix A to which we would like to compute a distributed PCA. 

Lemma 42. Let M.be an m x n matrix with integer entries bounded in magnitude by for a constant 
C > 0, and suppose m < n. Let Nm,n ^ matrix with independent entries each chosen to be 1/n^ with 
probability 1/2 and —1/n^ with probability 1/2, where D > 0 is a constant. Then, there is a constant 
B > 0 depending on C and D for which 

Pr[(Tm(M + Nm,n) < 1/n^] < 1/n, 

where (Tm(M + Nm,n) denotes the smallest singular value o/M + Nm,n- 

Proof. Suppose we were to pad M with n — m zero rows, obtaining a square matrix M. Now 
consider the n x n matrix N„ „ with independent entries each chosen to be 1/with probability 
1/2 and —XjnP with probability 1/2. By Corollary 41, all singular values of M + N„ „ are at least 
XjnP. Now consider a unit vector x G which is zero on all but its top m coordinates. Let 
y G be the unit vector which agrees with x on its top m coordinates. Then, 

1/n^ < ||x(M + N,i,„)||2 

= ||y(M + N,„,,,)|| 2 , 

where the inequality uses the lower bound on the singular values of M + N„ which occurs with 
probability at least 1 — 1/n, and the equality follows by definition of the matrices and vectors we 
have defined. As y G can be chosen to be an arbitrary unit vector, it follows that + 

Nm,n) > 1/n^, which completes the proof. ■ 

6.4 Description of algorithm 

Using the above results, we are now ready to describe a distributed PCA algorithm whose commu¬ 
nication cost can be bounded in terms of machine words and not just in terms of "real numbers". 
As in the algorithm in Section 5.3, we denote with Bj G the matrix that Bi arises from Ai 

after applying the Bernoulli perturbation technique discussed above in Lemma 42 and Vi > 1, Bj 
is equal to Aj. Using this notation, we have B := ^ Notice that B exactly arises from 

A after applying the such Bernoulli perturbation technique. 

Input: 

1. A G arbitrarily partitioned A = A^ for i = 1 : s, A^ G 

2. rank parameter k < rank(A) 

3. accuracy parameter e > 0 

Algorithm 
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1. Use the protocol of Lemma 31 with S = 0.01 to test if the rank of A is less than 2k. 

2. If rank(A) < 2k, use the protocol of Lemma 36 to find some orthonormal U G 

3. If rank(A) > 2k, 

(a) machine 1 locally and independently adds 1 /n^ with probability 1 / 2 , and —Xj'nP with prob¬ 

ability 1/2, to each of the entries of Ai, where D is the constant of Lemma 42. Note that this 
effectively adds the matrix of Lemma 42 to the entire matrix A. For notational conve¬ 
nience let B = A + Nm,n/ Bi G ^0 t]-i 0 local perturbed matrix of machine 1 and Vi > 1, 

Bi is just equal to A^. 

(b) Machines agree upon two dense Johnson-Lindenstrauss matrices S G T G with 

= 0{ke~'^),^2 = 0{ke~'^) (see Lemma 27). 

(c) Each machine locally computes B^ = SB^T and sends B^ to the server. Server constructs B = 

(d) Server computes the SVD of Bfe = G G G 

(e) Now server rounds each of the entries in to the nearest integer multiple of 1/n'’' for a 
sufficiently large constant 7 > 0. Let the matrix after the rounding be 

(f) Server sends to all machines. 

(g) Each machine construct = B^TV^ and sends to the server. Server constructs X = 

(h) Server computes an orthonormal basis U G R""^^ for span(X) (notice that rankpC) < k), e.g. 
with a QR factorization. 

(i) Server sends U to each machine. 

6.5 Main result 

The theorem below analyzes the approximation error, the communication complexity, and the 
running time of the previous algorithm. Notice that the communication cost of this algorithm is 
given in terms of machine words. 

Theorem 43. The matrix U G with k orthonormal columns satisfies with arbitrarily large constant 

probability: 

IIA-UU^AIll < (l+£)-||A-Afclll. (6) 

The communication cost of the algorithm is 0{msk + s ■ poly(/c/e)) words and the running time is 
O {nmke~‘^ + mk‘^e~'^ -F poly(A:e“^)) . 

6.6 Proof of Theorem 43 

Proof of Eqn. 6 If rank(A) < 2k, then Lemma 36 shows the claim. If rank(A) > 2k, then the 
situation is more involved and we prove the approximation bound in the following subsection. 

Communication Complexity Step 1 requires 0{sk‘^) words (Lemma 31). Step 2 requires 0{smk+ 
slff words (Lemma 36). Step 3 requires 0{msk + s ■ poly(A:/e)) words and this follows from the 
analysis in Theorem 30. The only difference with Theorem 30 is that now all matrices communi¬ 
cated in the algorithm have real numbers which can be represented efficiently with one machine 
word of at most 0{log{mns/e)) bits. To see this, note that each entry of is bounded in mag¬ 
nitude by . Therefore, the entries of Vg^ and fhe entries of X* = BiTVg^ can each be 

described using 0(log n) bits, i.e., a constant number of machine words. 
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Running time The operations in the third step in algorithm are effectively the same operations 
as in the batch algorithm in Section 5.2 (plus some extra operations whose running time is not 
asymptotically larger), hence the analysis of the running time in Theorem 29 shows the claim (the 
number of operations in the first two steps of the algorithm is asymptotically less than the number 
of operations in the third step). 


6.7 Proof of Eqn. 6 if rank(A) > 2k 

Lemma 44 (Result in [21]). Suppose M G For e > 0, let P G be a matrix of which 

entries are 0{q)-wise independently and uniformly chosen from {—l/v^, where ^ = 0{qle^). 

With arbitrarily large constant probability, P is an e-subspace embedding matrix of column space of M. 
Specifically, Vx, it has 

(1 -e)||Mx ||2 < ||PMx ||2 < (1 + £)||Mx ||2 

Lemma 45 (Lemma C.2 in [40]). Suppose P is an e-subspace embedding matrix of column space of 
M G (q < p). With arbitrarily large constant probability, VI < f < g 

(1 - e)n*(PM) < ni(M) < (1 + e)ni(PM) 

where ai means the singular value. 

It is suffice to show the following: 

Lemma 46. With arbitrarily large constant probability, ||(BTVg ^)^||2 < 8n^ 

Proof. Notice that BTVg^ is an m x /c matrix. With high probability, S is a e-subspace embedding 
matrix of column space of BTVg^ due to Lemma 44. According to Lemma 45, (Tmin(BTVg^) > 
(1 — e)(Tmin(SBTVg^) where amin means the minimum singular value. Without loss of gener¬ 
ality, we assume e < 1/2. Then iTmm(BTVg^) > ^crmm(SBTVg^). Because are the top 
k right singular vectors of SBT, iTmm(SBTVg^) > crmm(SBT). Applying Lemma 44 again, 
is a constant error subspace embedding matrix of column space of B^S^. Combining with 
Lemma 45, we can make cJmm(SBT) > icJmi„(SB). Notice that Lemma 44 also implies that 

is a constant error subspace embedding matrix of which means Vx G M^L we can 
make ||xy^S ||2 > 5 l|x|| 2 . Thus, Vx G ||xy^SB ||2 > ^cJmm(B)||x|| 2 . Due to Lemma 42, 
aminiB) > X/nP which implies (Tmm(SB) > i • yJ~^(Tmin{B) > To conclude 

||(BTVgjt ||2 < < 8n®. 


Let E = Vs — Vs , we have 


A-UU^AIIf = 
< 

< 

< 

< 

< 


||B-N,„,,,-UUT(B-N„,,,,)||f 

||B-UU'^B||f + ||N,„,,,||f 

||B - (BTVgJ(BTVgJ^B||F + ||N,„,,,||f 

||B - (BT(Vg^ + E))(BTVgjtB||p + ||N„,,,,||f 

||B - (BTVgJ(BTVgJ+B||F + ||BTE(BTVg+ ||N,„,,,||f 

(1 + 0(£))||B - BfcllF + ||BTE(BTVgjtB||p 

(l + 0(£))||B-Bfc||F 
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The first equality follows by using the relation B = A + Nm,n- The first inequality uses the tri¬ 
angle inequality for the Frobenius norm and the fact that I — UU^ is a projector matrix and can 
be dropped without increasing the Frobenius norm. The second equality uses the fact that both 

UU^ and (BTVg^)(BTVg^)^ are the same projector matrices. The second inequality follows by 
VM, C, X, ||M — CC^M||f < ||M — CX||f and the relation E = — Vg^. The third inequality 

uses the triangle inequality. The fourth inequality uses that ||B — B^Hf > l/poly(mns/e) (follows 
from Lemma 42) while ||Nm,n||F can be made 1/n* for an arbitrarily large integer t, and the fact 
that ||B — (BTVg^)(BTVg^)^B||F < (1 + 0(e))||B — B^Hf (implied by replacing all A with B in 
Lemma28). The lastinequality uses the fact ||BTE(BTVg^)^B||F < ||B||F||T||F||E||F||(BTVg^)^||F||B||F 
where ||B||f,||T||f, ||(BTVg^)^||F are poly (nms/e) (Lemma 46 implies the bound of ||(BTVg^)^||F) 
and we can make |E|f be l/n^ for an arbitrarily large p, and ||B — B^Hf > l/poly(mns/e). 

Overall, after rescaling e, we have 

IIA-UU^AIIf < (1 + £)||B-BfcllF. (7) 

Finally, we need to relate ||B — B^Hf to ||A — A^Hf, which we do in the following lemma. 

Lemma 47. Let A G any matrix, Nm,n the random matrix of Lemma 42, and B = A -|- Nm,n- 

Let k < rank(A), e > 0. Then, for arbitrarily large constant probability, 

||B - B^-IIf < (1 -I- e)|| A — A^-IIf. 

Proof Since B = A -|- the proof idea is to relate the singular values of B to the singular 

values of A using WeyTs inequality. Specifically, for i : 1 : m (recall we assume m < n) we have 

|(Ti(B) - Ui(A)| < ||N„,,,i||2 < 1/n^. 

Rearranging terms in this inequality, and taking squares of the resulting relation we obtain (for 

i = 1 : m), 

o-,^(B) < (fTi(A) -6 l/n^f . (8) 

We manipulate the term ||B — B^Hf as follows: 



^ i=k+l 

< ||A — A^IIf -I- e • II A — A^IIf -I- e • || A — A^Hf 

< (1-|-2e)||A — A/jIIf- 
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The first inequality uses Eqn. (8). In the third inequality, we used 


< e||A - Afcllp, 

which follows from Corollary 39 for a sufficiently large constant D. Also, we used 

Y^2cjj(A)/n^ < Y^2cJi(A)/n'^ < Y^poly(nms/e)/n'^ < e||A - A^Hf, 

where the second inequality follows because ||A ||2 < poly(nms/e) and the last inequality uses 
again Corollary 39 for a sufficiently large constant D. ■ 

Completing the proof. Using Lemma 47 in Eqn (7), taking squares in the resulting inequality, 
and rescaling e concludes the proof. 

7 Streaming Principal Component Analysis 

In this section, we are interested in computing a PCA of a matrix in turnstile streaming model. 
Specifically, there is a stream of update operations that the operation has form {iq,jq,Xq) which 
indicates that the entry of A G should be incremented by Xg where fg G {l,...,m},jg G 

{1,..., n}, Xg G M. Initially, A is zero. In the streaming setting of computation, we are allowed only 
one pass over the update operations, i.e., the algorithm "sees" each update operation one by one 
and only once. Upon termination, the algorithm should return a matrix U with k orthonormal 
columns which is a "good" basis for span{A) . In Section 7.1.1 below, we describe an algorithm 
which gives a space-optimal streaming algorithm. Eurther more, we provide a variation of this 
algorithm in Section 7.1.2 which can output A^ satisfying || A — A^llp < (1 -|- e) • ||A — A^llp. It 
meets the space lower bound shown in [21]. In Section 7.2, we relax the problem and we describe 
a two-pass streaming algorithm which is a version of the algorithm of Section 5.2; unfortunately, 
the space complexity of this algorithm can be bounded only in terms of "real numbers". We fix 
this in Section 7.3 where we describe an optimal two-pass streaming algorithm. 

Inputs to the algorithms in Section 7.1, Section 7.2 and Section 7.3 are a stream of update 
operations (ii, ji,xi), {i 2 ,j 2 ,X 2 ), ■■■, {ii,ji,xi), a rank parameter k < rank(A), and an accuracy 
parameter 0 < e < 1. 

7.1 One-pass streaming PCA 

7.1.1 The algorithm which outputs U 

In the following algorithm, the output should be U with orthogonal unit columns and satisfying 

||A - UU'^AIll < (1 -h e) • IIA - Afclll 
Our algorithm uses the following property of random sign matrices: 

Lemma 48 (Sketch for regression - Theorem 3.1 in [21]). Suppose both of A and B have m rows and 
rank{A) < k. Further more, if each entry ofS G is 0{k)-wise independently chosen from {—1, -1-1} 
where ^ = 0{k/e) and 

X = arg min ||SAX-SB||| 

X 

, with probability at least 0.99, 

||AX - Bill < (1 e) • min ||AX - B||| 
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Algorithm 

1. Construct random sign sketching matrices S € with = 0(ks~^) and R € with 

^2 = 0(ke~^) (see Lemma 48) 

2. Construct affine embedding matrices Tteft G R^^^™ and Tright G R"^^^ with ^3 = 0(k£~^), ^4 = 
0{k/£~^) (see Definition 33). 

3. Initialize all-zeros matrices: M G L G N G D G We will maintain 

M, L, N, D such that M = Tie/tATright, L = SAT^g/ii, N = Tje/tAR and D = AR. 

4. For {iq,jq, Xq) := (ii, ji, xi),(one pass over the stream of update operations) 

(a) For all t — j — 1, ^ 4 , let IVIj J — IVI^J -t- (T/e/i)z,ig * Xq * (J^right)jg,j- 

(h) For all i — j — 1,..., ^ 4 , let ^ 4“ ‘ right)jg,j • 

(c) For all t = 1 ,...,^ 3 , j = 1,^ 2 ? l^t -t- * Xq • Rj^ j. 

(d) For all j = 1,..., $ 2 , let Di^j = + Xq ■ Rj^,j 

5. end 

6 . ConstructX, = argmin ||N-X-L—M|||. (NoticethatX* = argmin ||T;e/t (ARXSA — A) Tright||p.) 

rank(x)<fe rank(x)<fc 

7. Compute the SVD of X* = Ux.Sx.V];;^ (Ux. G R^''^ Sx. G Vx. G then, compute 

T = DUx.. 

8 . Compute an orthonormal basis U G R™^^ for span{T). 

Theorem 52 later in this section analyzes the approximation error, the space complexity, and 
the rurming time of the previous algorithm. First, we prove a few intermediate results. 

Lemma 49. For all matrices X G probability at least 0.98; 

(1 - e)2||ARXSA - A||| < ||T/e/t (ARXSA - A) TrightUp < (1 + e)^||ARXSA - A||| 

Proof. Notice that ranA:(AR) < ^2 = 0{ke~^). Then from Lemma 34 (G := AR and H := A), 
with probability at least 0.99 for all Y G 

(1 - £)||ARY - A||2 < ||Tie/t(ARY - A)||2 < (1 + £)||ARY - A||| 

Replacing Y = XSA, for an arbifrary X G obtain that with probability at least 0.99 and 

for all X G simultaneously: 

(1-e)|| ARXSA-A||| < \\Tieft (ARXSA-A)||| < (1 + e)|| ARXSA - A||| 

Now notice that rank{SA) < = 0{k£~^). Then from Lemma 34 (G := A^S^ and H := 

A^T^^jj), with probability at least 0.99 for all Z G 

(1 - £)||ATsTzT - A^Tl^.Wl < ||T,Vi(A^STzT - < (1 + e)||ATsTzT - A^tI^,\\1 

Replacing Z = X^R^A^Tj^j^ for an arbitrary X G R^^x^i^ obtain that with probability at least 
0.99 and for all X simultaneously: 

(l-e)2||(ARXSA-A)T||| < \\{Tieft (ARXSA - A) Tright)'^fp < (1 + £)2||(ARXSA - A)T||2 
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Lemma 50. Let^opt = ^'^g™™rank(x)<fc l|ARXSA —A||| zfzf/i g Then, with probability 

at least 0.98 

min \\Ti,ft (ARXSA - A) T^ghtlll < ||ARX,SA - A||| < ii±£^|| ARX^^^SA - A||| 
rank(X)<fc (1 ~ 

Proof. From Lemma 49, we have that for all matrices X G with probability at least 0.98: 

(1 - ef ■ II ARXSA - A||| < ||T;e/t (ARXSA - A) Trightllp < (1 + ' || ARXSA - A||| 

Applying this for X := Xopt, we obtain: 

(1 — e)^ • II ARXopjSA — A||p < ||T;e/t (ARXop^SA — A) T,,jg/,i||p < (1+ e)^ • HARX^p^SA — A||p 
Applying this for X := X*, we obtain: 

(/) 

(1 - sf • II ARX.SA - A||| < \\Tieft (ARX.SA - A) TrightWl < (1 + ' || ARX^p^SA - A||| 

Combining (/), (e) along with the optimality of X*, formally using the relation: 

||T;e/t (ARX*SA - A) Trightllp < W^left (ARXoptSA - A) Trightllp, 


shows the claim. 


In words, the lemma indicates that in order to (1 + e)-approximate the optimization problem 
II ARXSA — A|||, it suffices fo "sketch" the matrix ARXSA — A from left and right 

with the matrices T/e/t and Tright- Recall that X* = argmin ||T;e/t (ARXSA — A) TrightW^- 

rank(x)<fc 

Lemma 51. Let IKopt = argminj.^j^p^^^^p || ARXSA — A||p. Then, with probability at least 0.98 

IIARXoptSA — A||p < (1 + e) • || A — A^Hp, 

Proof. Suppose the SVD of A = UAXAV^and A^ = Uaj.Xaj,V^^. Consider about the following 
regression problem: 

min ||Uai.X — A||| 

rank(^)<k 

Since we can choose X to be A, we have 

min IlUAfcX - A||| < ||A - Afclll 

rank{'X.)<k 

Let X = arg mmrank(x)<k IISUa^X — SA|||. According to Lemma 48, with probability 0.99 
||Ua,X - A||2 < (1 + e) min ||Ua,X - A||2 < (1 + £)|| A - A^Wl 

rank[X.)<k 


Since X minimizes ||SUAfcX — SA|||, the row space of X is in the row space of SA (Otherwise, 
we can project SUa^X into the row space of SA to get a better solution). We denote X = WSA 
where rankifN) < k. Now, consider about another regression problem: 


min 

rank{'X.)<k 


||(WSA)TxT-A^ lll 
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If we choose X to be UAt/ we have 


min ||(WSA)TxT - < ||Ua,X - A||| < (1 + e)||A - Afc||| 

rank{X.)<k 

Let X = argmin^„„fc(x)<fc ||R^(WSA)Tx'^ - R'^A^|||. According to Lemma 48, with probability 
0.99 

||R^(WSA)Tx^-RTaT||| < (1 + e) min ||(WSA)TxT - A^lH < (1 + e)2||A - Afc||| 

rank(X.)<k 

Since X minimizes ||XWSAR — AR|||, the column space of X is in the column space of AR. We 
denote X = ARG where rank{G) < k. Thus, we have 

II ARXoptSA - A||| < II ARGWSA - A||| < (1 + £)^|| A - Afc||| 

We scale the s with a constant factor. The statement is shown by applying union bound. ■ 

Theorem 52. The matrix U G ivith k orthonormal columns satisfies w.p. 0.96; 

||A - UU^AIll < (1 + O (e)) • IIA - Afc|||. 

The space complexity of the algorithm is O {mkfe + poly(/ce“^)) words, and the running time for each 
update operation is 0{poly{ke~^)) and the total running time is of the order O {l ■ poly(/ce“^) + mk‘^e~^) 
where I is total number of updates. 

Proof Because UU^ projects the columns of A into the column space of T/e/jARX*, 

IIA - UU^AIll < ||Tte/i (ARXSA - A) TrightWl 
According to Lemma 50 and Lemma 51, 

||A - UU^AIll < (1 + O (e)) • ||A - Afc||| 

To see the space complexity of the algorithm, observe that it suffices to maintain the matrices 
in the fourth step of the algorithm. Furthermore, observe that by the end of the stream: M = 
TieftATright G L = SATright G N = TieftAR G D = AR G 

Those matrices form the so called "sketch" of the algorithm. 

Since all of Ci; C 2 , ^ 3 , ^4 are 0{ke~^), the running time for each update operation in the fourth 
step is 0(poly(/ce“^)). We can do the sixth step by using the result of Section 6.2.2. The com¬ 
putation takes poly(/ce“^) running time. In the seventh step, computing SVD needs poly(fce“^), 
and computing T needs 0{mk‘^e~^). We can use 0{mk‘^) to compute U in the final step, e.g. QR 
decomposition. ■ 

7.1.2 A variation which outputs A^ 

We just slightly modify the previous algorithm in Section 7.1.1 to get the following one which can 
output a matrix A^ G M™-xn satisfying 

— (1+^) ■ l|4^“Afc|lF 


Algorithm 
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1. Construct random sign sketching matrices S G with = 0{ke and R G with 

^2 = 0{ke~^) (see Lemma 48) 

2. Construct affine embedding matrices Tje/i G and Tright G with ^3 = 0{ks~^), ^4 = 

0{k/e~^) (see Definition 33). 

3. Initialize all-zeros matrices: M G L G N G J 3 g and C G We 

will maintain M,L,N, D,C such that M = T/e/tAT^jg^i, L = SAT^ighti N = T/e/tAR, D = AR 
and C = SA. 

4. For {iqjjq, Xq) := (ii, ji, xi),..., {ii,ji,xi) (one pass over the stream of update operations) 

(a) For all f j 1, ^ 4 , let IVIj J IVI^J -t- (T/e/i)z,ig * Xq • right') 

(h) For all i — j — 1,..., ^ 4 , let ^ ~k ^i,iq * Xq * (T^ right )jg ,j • 

(c) For all f 1 ,...,^ 3 , j 1, ^ 2 ? l^t Nj ^ 4“ i^teft)i,ig ‘ Xq * Rjg J. 

(d) For all j = 1,..., ^ 2 , let ^ + Xq ■ Rj„,j 

(e) For all i — 1, let — Cj -t- • Xq 

5. end 

6 . ConstructX* = argmin ||N-X-L—M|||. (NoticethatX* = argmin ||Tie/i (ARXSA — A) Tright|||.) 

rank(x)<fe rank(x)<fe 

7. Compute the SVD of X* = Ux.Sx.V];;^ (Ux. G Rf^^ Ex. G Vx. G R^'^'^); then, compute 

T = DUx. 

K = C 

8 . Output AJ = TSx.K 

Theorem 53. With probability at least 0.96; 

l|4^ “ < (1 -I- O (e)) • ||A — Afc|||. 

The space complexity of the algorithm is O [{m + n)kle + poly(A:e“^)) words, and the running time for 
each update operation is 0{poly{ke~^)) and the total running time is of the order 

O ((• poly(A:e“^) + {m + n)k‘^e~^ + mnk) 

where I is total number of updates. 

Proof. Notice that = ARX*SA. According to Lemma 50 and Lemma 51, we have 

l|4^ “ < (1 + O (e)) • ||A — Afc|||. 

The total space of "sketch" matrices M, L, N, D is the same as the algorithm in Section 7.1.1. The 
maintenance of C needs additional 0{nke~^) space. 

The running time is almost the same as before. Computation of K = needs additional 

0{nk‘^e~^), and computation of = TEx,K needs additional 0{mnk). ■ 
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7.2 Two-pass streaming PCA with "real numbers" space complexity 

A simple modification of the batch algorithm in Section 5.2 leads to a two-pass streaming algo¬ 
rithm with 0{mk) + poly(/c/e) "real numbers" space complexity: 

Algorithm 

1. Construct two dense Johnson-Lindenstrauss matrices S € T € with = 0{ke~‘^), ^2 = 

0{ke~^) (see Lemma 27). 

2. Initialize all-zero matrices: A G X G R™^^. 

3. ¥01 {iq,jq,Xq) := (ii, ji, Xi),..., (i/, j/, x/) (first pass ovet the Stream) 

4. For all i — 1,..., , j — 1,..., ^25 let * Xq • Tjg. 

5. end 

6 . Compute the SVD of Afc = (U^, G R«l^^ G R'=><^ G Rf^^'^). 

7. For {iq,jq,Xq) := {ii,ji,xi ),..., {ii,ji,xi) (second pass over the stream) 

8 . For all z = 1,..., ^ 2 ) j = !> k, let Xq ■ • (V^^)jj-. 

9. end 

10. Compute an orthonormal basis U G R"*^^ for spanCK) (notice that rankCX.) < k). 

The theorem below analyzes the approximation error, the space complexity, and the running 
time of the previous algorithm. Notice that the space complexity of this algorithm is only given in 
terms of "real numbers" - we resolve this issue in the following section. 

Theorem 54. The matrix U G with k orthonormal columns satisfies w.p. 0.98; 

||A - UU^AIll <{1 + 0 (e)) • ||A - Afc|||. 

The space complexity of the algorithm is 

O {mk + poly(A:e“^)) 

"real numbers", the running time of each update operation is 0{poly{ke~^)), and the total running time is 
of the order 

O {l ■ poly(A:e“^) -|- mk'^) 
where I is the total number of update operations. 

Proof. The matrix U - up to the randomness in the algorithms - is exactly the same matrix as in the 
algorithm in Section 5.2, hence Theorem 29 shows the claim. 

To see the space complexity of the algorithm, observe that it suffices to maintain the matrices in 
the fourth and eighth steps of the algorithm. Furthermore, observe that by the end of the stream: 

A = SAT, 


and 

X = ATVi 

Afc 


Since ^ 1 , ^2 are poly{ke ^), the running time for each update operation is poly{ke ^). Comput¬ 
ing SVD in the sixth step needs poly(A:e“^). Computing U in the final step needs 0{mk‘^). ■ 


36 



7.3 Two-pass streaming PCA with bit space complexity 

The previous algorithm could suffer from large space complexity in case we need a lot of machine 
words to write down the entries of To fix this issue we use the same idea as in the case of the 
distributed PCA algorithm in Section 6. This leads to a two-pass streaming algorithm for PCA. 
Again, as in the distributed case, we need to test if the rank of A is less than 2k, and then we use 
one approach if rank(A) < 2k and another approach if rank(A) > 2k. Both of these approaches 
can be implemented with two passes. In the overall streaming PCA algorithm that we would like 
to design, we can not wait for the algorithm that tests the rank to finish and then start running 
one of the two PCA protocols, because this will lead to a three-pass algorithm (one pass to test 
the rank and two passes for the actual PCA protocol). To keep the number of passes to two, we 
just start running the PCA protocols in parallel with the protocol that tests the rank of the input 
matrix. In the end of the first pass over the stream of update operations, we already know which 
protocol to follow, and we just do this, disregarding the other protocol. 

We already discussed the streaming version of the algorithm that tests the rank of the matrix 
in Lemma 31. Below, we describe separately the two streaming PCA protocols. 

7.3.1 Streaming PCA protocol when rank( A) < 2k 

The idea here is to implement a streaming version of the distributed PCA protocol in Lemma 36. 
Algorithm 

1. Construct an n x 2fc matrix H" as in Lemma 36. 

2. Construct affine embedding matrices Tje/t G and Tright G with = 0(k/e^), ^2 = 

0{k/e^) (see Definition 33). 

3. Initialize all-zeros matrices: C G ]y[ ^ L g N g 

4. {iq,jq,Xq) := {ii, ji, Xi),{ii, ji, xi) (first pass over the stream) 

(a) For all j = 1,..., 2A:, let Ci, ^ + Xg ■ H" j 

5. end 

6 . For {iq,jq,Xq) := {ii,ji,xi ),..., {ii,ji,xi) (second pass over the stream) 

(a) For all i — j — 1 ? •■• 7 ^ 2 ? 1^1 AIi,J — T ■ {bright} jqj ‘ 

(b) For all i — 1,..., 2k, j — 1,..., ^ 2 , let Ljj' — T C i^iq * Xq • (fright')jq,j• 

(c) ForalH = J = 1,2fc, let -h iTieft)i,iq ■ Xq ■ H" 

7. end 

8 . Construct 

X*= argmin ||N • X • L - M||| := argmin \\TieftAIl"X{liyATright - TieftATrightWl- 
rank(x)<fc rank(x)<fc 

9. Compute an orthonormal basis U G for span(CX*). 
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7.3.2 Streaming PCA protocol when rank( A) > 2k 

The idea here is to implement a streaming version of step 3 of the algorithm in Section 6.4. 
Algorithm 

1. Construct two dense Johnson-Lindenstrauss matrices S € T € with = 0{ke~^),^2 = 

0{ke~‘^) (see Lemma 27). 

2. Initialize all-zero matrices: B G X G 

3. Forp := q ■= I,--,6, * := j := l,...,n 

4. Let rij = —Ijn^ with probability 1/2 and Vij = XjriP with probability 1/2. 

5. Bp g — Bp g -j- Sp^j * rj j ■ Tj g 

6 . end 

7. For (iqJqjXg) := (first pass over the stream) 

8 . For all 1 = 1,..., j = 1,..., lot B^ j = Bj^- -1- * Xq • 

9. end 

10. Compute the SVD of Bfc = (U^^ G R«i^'=, g R'=''^ G R^^^''). Then, we 

round each of the entries of to the nearest integer multiple of 1 /n'’' for a sufficient large constant 
7 > 0. Let the matrix after the rounding be . 

11. For {iq,jq,Xq) := {ii,ji,xi ),..., {ii,ji,xi) (second pass over the stream) 

12. For alH = 1,...,C2, j = 1,...,*, letX ,^7 = + Xq ■ 

13. end 

14. Compute an orthonormal basis U G R™^^ for span{%) (notice that rank{yi) < k). 

7.3.3 Main result 

The theorem below analyzes the approximation error, the space complexity, and the running time 
of the previous algorithm. Notice that the space complexity of this algorithm is given in terms of 
machine words. 

Theorem 55. The matrix U G with k orthonormal columns satisfies with arbitrarily large constant 

probability: 

||A - UU^AIll < (1 + e) • ||A - Afclll. 

The space complexity of the algorithm is 

O {mk + poly(A:e“^)) 

machine words, the running time of each update operation is 0{poly{ke~^)), and the total running time is 
of the order 

O [l ■ poly(fce“^) -|- mk^) 
where I is the total number of update operations. 

Proof. The matrix U - up to the randomness in the algorithms - is exactly the same matrix as in the 
algorithm in Theorem 43, hence the approximation bound in that theorem shows the claim. 

To see the space complexity of the algorithm, observe that it suffices to maintain all the "sketch" 
matrices in the both two protocols above. 

Since ^2 in both protocols are poly(A:e“^), the running time for each update operation is 
0(poly(A:e“^)). Additional 0{mk‘^) running time is caused by computing U in the final step. ■ 


38 



8 Distributed PCA for sparse matrices in column-partition model 

Recall that in the problem of Definition 15 we are given 1) an m x n matrix A partitioned column¬ 
wise as A = (Ai A 2 ... As) , with Aj G ]^rnxwi 2 ) ^ rank parameter k < rank(A); 

3) an accuracy parameter e > 0. We would like to design an algorithm that finds an m x /c matrix 
U with U^U = Ifc and, upon termination, leaves this matrix U in each machine of the network. 

The high level idea of our algorithm is to find, in a distributed way, a small set - 0{ke~^) - oi 
columns from A and then find U in the span of those columns. To choose those 0{ke~^) columns 
of A in a distributed way we implement the following three-stage sampling procedure: 

1. Local sampling: Each machine samples 0{k) columns using the sampling algorithm from [24]. 

2. Global sampling: The server collects the columns from each machine and down-samples 
them to 0{k) columns using the deterministic algorithm from [17]. 

3. Adaptive sampling: the server sends back to each machine those 0{k) columns; then, an 
extra of 0{k£~^) columns are selected randomly from the entire matrix A using [26]. 

We argue that if C contains the columns selected with this three-stage approach, then, w.p. 0.99, 

||A - CC^AIll < ||A - n?^^(A)||| < (1 + 0(e)) • ||A - Afc|||. 

Though we could have used TI? ^(A) to be the rank k matrix that approximates A, we are not 
familiar with any communication-efficient computation of II? ^(A). To address this issue, using 
an idea from [37], we compute U G span(C) suchthat || A —UU^A||| < (l-|-0(e))|| A—11? ^(A)|||; 
this U can be calculated with small communication cost and it is sufficient for our purposes. 

Before presenting the new algorithm in detail, we discuss results from previous literature that 
we employ in the analysis. 

8.1 Background material 

8.1.1 Column sampling algorithms 

Our distributed PCA algorithm in Section 8 samples columns from the input matrix in three stages: 

1. Local sampling: 0(k) columns are selected locally in each machine. 

2. Global sampling: 0{k) columns are selected in the server. 

3. Adaptive sampling: an extra 0{ke~^) columns are selected from the entire matrix A. 

In the first stage, we use a sampling algorithm mentioned in [24]. Actually, the sampling 
algorithm is the same as the Batson, Spielman, and Strivastava (BSS) spectral sparsification al¬ 
gorithm [11]. But Cohen et al. showed that a small set of columns sampled by BSS sampling 
algorithm can provide a projection-cost preserving sketch. In the second stage, we use a deter¬ 
ministic algorithm developed in [17], which itself extends the Batson, Spielman, and Strivastava 
(BSS) spectral sparsification algorithm [11]. For the actual algorithm in the first stage we defer 
the reader to Theorem 15 in [24]. Lemma 56 states the result. And for the actual algorithm in the 
second stage we defer the reader to Lemma 3.6 in [17]. Lemma 57 and Lemma 58 below present 
the relevant results. In the third sampling stage, we use an adaptive sampling algorithm from [26]. 
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Lemma 56 (BSS Sampling for Projection-Cost Preserving Sketch. Theorem 15 in [24]). Let A G 
arbitrary matrix. For any integer 0 < k < m, real numbers 0 < e < 1,0 < 6, there 
is a randomized algorithm that runs in 0{nnz{A)log{m/5) + m ■ poly(/c, e, log(l/(i))) time, and con¬ 
structs a "sampling/rescaling" w x Oikje^^ matrix S with probability at least 1 — 5 such that S is 
an {e, k)-proiection-cost preserving sketching matrix of A. We denote this sampling procedure as S = 
BssSamplingl {A,k,e,S). 

In words, there exists a randomized algorithm that runs in near input-sparsity running time 
can construct a projection-cost preserving sketch of a given matrix A by sampling and rescaling a 
small subset of columns of A. 


Lemma 57 (Dual Set Spectral-Frobenius Sparsification. Lemma 3.6 in [17]). Let V G ^ 

matrix with w > k and V^'^V = 1^. Let E G be an arbitrary matrix. Then, given an integer £ such 

that k < £ < w, there exists a deterministic algorithm that runs in O [£wk‘^ + mw) time, and constructs a 
"sampling/rescaling" w x £ matrix S such that 

4 (v^s) > (l - , IIESIII < ||E||2 . 


Specifically, rankfV'^S) = k. We denote this sampling procedure asS = BssSamplingl I (V, E, £). 


In words, given V and E, there exists an algorithm to select (and rescale) a small number of 
columns from E and rows from V such fhaf: 1) fhe Frobenius norm squared of fhe sampled E is 
less fhan the Frobenius norm squared of E; 2) the sampled V has rank equal to the rank of V; and 

3) the smallest singular value squared of the sampled V is bounded from below by ^1 — \/kj£^ . 


Lemma 58 (Constant factor column-based matrix reconstruction; Theorem 5 in [17]). Given ma¬ 
trix G G R”^^“ of rank p and a target rank k ^, there exists a deterministic algorithm that runs in 
O (amminja, m} -|- ack'^) time and selects ok columns ofG to form a matrix C G R”^^^ with 


|G - CC^GI 


F ^ 


1 + 1 - 




-2 


rank{G) 

E ^*(G). 




The algorithm in this theorem finds C as C = GS, where S = BssSamplingII{Y,G — GVV"'^,c) 
and V G R"^^ contains the top k right singular vectors of G. We denote this sampling procedure as 
C = DeterministicCssFrobenius{G,k, c). 

In words, there exists a deterministic algorithm, running in polynomial time, to select any 
number oi c > k columns from the given matrix G, such that the residual error, in Frobenius 
norm, from projecting G onto the span of the sampled columns is bounded from above with 
respect to the residual error of the best rank k matrix from the SVD of G. Notice that 

rank(G) 

E = 11^ “ 

i=k+l 


where G^ G R”^^“ has rank at most k and is computed via the SVD of G. 

^The original Theorem 5 in [17] has the assumption that k<p, but this assumption can be dropped having the result 
unchanged. The only reason the assumption k < p exists is because otherwise column subset selection is trivial. 
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Before presenting the adaptive sampling theorem from [26] we introduce some useful notation 
used in the lemma. Let A G k < n, and V G with k < c < n. 11^ ^(A) G is the 

best rank k approximation to A - wrt the Frobenius norm - in the column span of V. Equivalently, 

^^.(A) = VXopt, where 

Xopt = argmin || A - VX|||. 

xeR“’^",rank(x)<fc 

Lemma 59 (Adaptive sampling; Theorem 2.1 of [26]). Given A G and V G (with 

Cl < n, m), define the residual matrix 'J' = A — VV^ A G For j = 1,..., n, let pj be a probability 

distribution such that pj > I3\\'^^^f\2/\\^\\p, for some 1 > /3 > 0, where is the j-th column of the 
matrix Sample C 2 columns from A in C 2 i-i.d. trials, where in each trial the j-th column is chosen with 
probability pj. Let C 2 G R ’^^'^2 contain the C 2 sampled columns and let C = [V C 2 ] G R”^^(‘^i+‘= 2 ) 
contain the columns 0 /V and C 2 . Then, for any integer k > 0, 

rank(A) 

E[||A-CCtA|||] <E[||A-nF,,(A)|||] < ^ + * . ||a - VVtA|||. 

i=k+l ^ ^ 

Given A and C, this method requires 0{cimn) time to compute another 0{mn) time to compute 
the probabilities pj's and another 0{n + C 2 ) time for the sampling step - using the method in [51]. In 
total, the method requires 0{cimn + C 2 ) time to compute 02 - We denote this sampling procedure as 
C 2 = AdaptiveCols(A,\, C2,(3). 

In words, given the matrix A and the subspace V, there exists a randomized algorithm to 
sample an additional of C 2 columns from A, based on probabilities from the residual error matrix 
A — VV^ A, such that residual error, in Frobenius norm, from projecting A onto the span of the 
union of the columns of V and the sampled columns is bounded from above with respect to the 
best rank k approximation to A, the residual A — VV^ A, and the number of sampled columns C 2 . 

8.1.2 Distributed adaptive sampling 

In our distributed PCA algorithm below, we also need to use a distributed version of the previous 
adaptive sampling procedure. We provide some preliminary results for that task in this section. 

Lemma 60. Suppose that the columns of an mxn matrix A are partitioned arbitrarily across the machines 
into matrices Ai,..., A^. Let C be an arbitrary m x r matrix. Gonsider the distribution p onn columns 
in which 

lla,' — CC^aJp 
= ||A-CCtA|||’ 

where aj is the j-th column of A (j = 1 : n here). 

For each i G [s], let some value fSi satisfies 

||A, - CCtA,||2 < < 2||A, - CCtAilll. 

For each j G [n], if column aj is held on the i-th server (denoted a*), then let 

(ji ||aj-CCta*||2 

“ E'-iA' ' ||Ai-CCtAi||2- 

Then for each j G [n], 

Pj/2 < Qj < 
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Proof. By definition of the /?*, we have that 

||A,-CCtA,||| ^ pi ^ 2||A,-CCtA,||| 
2|| A - CCtA||2 - A' “ ||A - CCtA||2 

Hence, 

Vjl‘1 < Qj < 2pj. 


Lemma 61. Suppose the coordinator has an m x r matrix C of columns of an m x n matrix A, where 
r = 0{k). Suppose the entries of A are integers bounded by poly(mns/e) in magnitude, and let the 
columns of A be partitioned arbitrarily across the servers into matrices Ai,..., A^. 

There is a protocol using 0{skm) machine words of communication for the coordinator to learn values 
/3* so that for all i G [s], 

||A, - CCtA,||| <pi< 2||A, - CCtAilll. 

Proof. The coordinator first sends C to all machines. The i-th server locally computes ||Aj — 
CC^Ajllp. If this number is 0, he/she sends 0 to the coordinator, and in this case /3* satisfies 
the requirement in the statement of fhe lemma. 

Otherwise, consider the matrix Bj which is formed by concafenafing the columns of Aj with 
those of C. Then 


||Bi - CCtBilll = ||Ai - CCtAilll > 0, (9) 

since the columns of Bj in C contribute a cost of 0. However, since Bj contains the columns of C 
and ifs cosf is non-zero, if implies fhe rank of Bj is at least r -|- 1. By Corollary 38, this implies 

||Bi - CC^Billl > {mns/e)-^^^\ 

which by (9) gives the same lower bound on || Aj — CC^ Aj||p. Note also that 

||Ai - CC^A jIIf ^ poly(mns/e), 

since || A||| < poly(mns/e). This implies if fhe i-th machine sends /3* fo fhe coordinator, where /3* 
is fhe nearest power of 2 to ||A* — CC^A*|||, there will only be 0{klog{mns/e)) possible values 
of /3*, and hence each can be specified using 0(log k -|- log log(mns/e)) bits, i.e., a single machine 
word. This completes the proof. ■ 

8.1.3 Low-rank matrix approximations within a subspace 

The final stage of our distributed PCA algorithm below finds U G span{C) such that the error 
of fhe residual matrix A — UU^A is "small". To implement this step, we employ an algorithm 
developed in [37]. 

Lemma 62. Let A G be the input matrix and V G R™^"^ be the input subspace. We further assume 
that for some rank parameter k < c and accuracy parameter 0 < e < 1 : 

11-^ “ nY^fc(A)||| < (1 -I- 0(e))||A — Afc|||. 

Let V = Y'Tf be a qr decomposition ofY with Y G R™^"^ and G Let H = Y^^AW"*' G R^^^, 

where G R"-^^ with ^ = 0{cle^), each element of which is chosen i.i.d. to be {+l/^/n, —1/yTi} with 
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probability 1/2. Let A G 


contain the top k left singular vectors ofS. Then, with probability at least 


1 - e-''. 


||A - YAATyTaIII < (1 + e)||A - Afc|||. 
Y, and A can be computed in 0{mn^) time. VJe denote this procedure as 

[Y, A] = ApproxSubspaceSVD{A, V, /c, e). 


Proof. This result was essentially proven inside the proof of Theorem 1.1 in [36]. Specifically, the 
error bound proven in [36] is for the transpose of A (also Y, A are denoted with U,V in [36]). 
As for the running time, given A,V, and k, one can compute (i) Y in 0{mc^) time; (ii) 3 in 
0{mn^ + mc^) time; and (iii) A in 0(c^^) time. ■ 

In words, the lemma indicates that given the matrix A and the subspace V such that 

11"^ “ nv,fc(A)||| < (1 + 0(e))||A — Afclll, 

for some small e > 0, there exists a randomized algorithm to compute matrices Y G span(V) 
and A such that the residual error, in Frobenius norm, from projecting A onto the span of Y A is 
also bounded by (1 + 0(e))||A — Afc|||. Notice that A contains the top k left singular vectors of 
a "sketched" version of Y^A, a property which makes the computation particularly effective in 
terms of running time. 


8.2 Detailed description of the algorithm 
Input: 

1. A G partitioned column-wise A = (Ai A 2 ... A^) ; for i = 1 : s, A^ G Wi = n. 

2. rank parameter k < rank(A) 

3. accuracy parameter e > 0 

Algorithm 

1. Local Column Sampling 

(a) For each sub-matrix Ai G compute G containing i = 0{k) columns from 

Ai as follows: = A^S^. Here, has dimensions Wi x £ and is constructed as follows: 

Si = BssSamplingI{Ai,4k, 1/2, 0.01/s) (see Lemma 56). 

(b) Machine i sends Ci to the server. 

2. Global Column Sampling 

(a) Server constructs m x {s ■ £) matrix G containing (s • £) actual columns from A as follows: 

G = (Ci C 2 ... Cg) . Then, server constructs C G via choosing ci = 4fc columns 

from G as follows: C = DeterministicCssFrobenius{G, k, cf (see Lemma 58). 

(b) Server sends C to all the machines. 

3. Adaptive Column Sampling 

(a) Machine i computes \hi = Ai — CC^Ai G and then computes di as it was described in 

Lemma 61. Machine i sends Pi to server. 

(b) Server computes probability distribution gi = ■ Server samples i.i.d. with replacement 

[SO/c/e] samples (machines) from gi. Then, server determines numbers L (i = l,2,...,s), 
where ti is the number of times the fth machine was sampled. It sends the t/s to the machines. 
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(c) Machine i computes probabilities g] = ||x|||/||\I>i||| (j = 1 : Wi), where x is the jth column of 
’®'i. And now machine i samples h samples from it's local probability distribution and sends 
the corresponding columns to the server. Let C 2 = L = [ 50fc/e ]. 

(d) Server collects the columns and assigns them to C G Server constructs C to be the 

m X (ci + C 2 ) matrix: C = (C; C) . Let 0 = 01+02=4^+1"SOfc/e]. 

4. Rank-fc matrix in the span of C 

(a) Server sends C to all the machines and each machine computes (the same) qr factorization of 
C: C = YR where Y G ]^”ixc orthonormal columns and R G is upper triangular. 

(b) Machine i generates G with ^ = 0{c/e'^) to be i.i.d. {+1,-1} w.p. 1/2. Implicitly 

all machines together generate W = (Wi W 2 ... Wg ) , with W G R^^". Machine i 

~ T T- 

computes Hj = C (A^W/) G R'^^''. Machine i sends to the server. 

(c) Server computes H = Xi=i Hi G R°^^ and sends this back to all the machines. Now machines 

compute H := R“^ • H • l/Vn{:= Y^AW^), where W = (Wi W 2 ... Wg ) G R^""” is 
random matrix each element of which is {+l/^/ri, w.p. 1/2, and fhen fhey compufe 

A G R°^^ to be the top k left singular vectors of H. Each machine compufes U = Y • A G 


Discussion. A few remarks are necessary for the last two stages of the algorithm. The third stage 
(adaptive column sampling), implements the adaptive sampling method of Lemma 59. To see this, 
note that each column in A (specifically the jth column in A,) is sampled with probability 


gi ■ Qj > 


1 

2 ' 


x:;- 




l|2 

II 2 
TIT ’ 
IIf 


where = A — CC^ A. This follows from Lemma 60. Overall, this stage effectively constructs C 

such that (see Lemma 59) 

C = AdaptiveCols{A, C, C 2 ,1/2). 

The last stage in the algorithm implements the algorithm in Lemma 62. To see this, note that 
W satisfies the properties in the lemma. Hence, 

[Y, A] = ApproxSubspaceSVD{A, C, k,s). 


8.3 Main Result 

The theorem below analyzes the approximation error, the communication complexity, and the 
rurming time of the previous algorithm. 

Theorem 63. The matrix C G with c = 0{k/e) columns of A satisfies w.p. at least 0.98, 

||A - C&A\\l < ||A - n?^^(A)||2 < (1 + Ois)) • ||A - UfcU^AIll. (10) 

The matrix U G xvith k orthonormal columns satisfies with probability at least 0.97, 

IIA - UU'^AIll < (1 + 0(e)) • IIA - UfcU^A|||. (11) 

Let each column of A has at most f non-zero elements. Then, the communication cost of the algorithm is 
O [sk(j)e~^ + and the running time is O {mns ■ poly{k, 1/e)). 
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8.4 Proof of Theorem 63 
8.4.1 Proof of Eqn. 10 

From Lemma 59: 


E 


|A- CC'A 


< E 


|A-nyA)|| 


< 


rank(A) 

E 

i=k-\-l 


(J-. 


’(A) + 


2e 


\A-CC^A\\l. 


( 12 ) 


Therefore, we need to bound || A — CC^A||p now. 

Lemma 64. With probability at least 0.99, matrix G is a (4/c, 1 /2)-projection-cost preserving sketch of A. 

Proof. For each i, C, = AjSj is a {4k, l/2)-projection-cost preserving sketch of A, with probability 
at least 1 — 0.01/s. Due to union bound, with probability at least 1 — 0.01, Vi, C* is a {4k, 1/2)- 
projection-cost preserving sketch of A*. Since C* is a {4k, l/2)-projection-cost preserving sketch of 
Ai, we can assume c, is a constant which is independent from projection matrix P which has rank 
at most 4k such that 

1 3 

-||Aj - PAilll < ||Ci - PCilll + Ci < -||Aj - PAjlll 

Then, for any projection matrix P which has rank at most 4k we have: 


1 * 

A - PA||| = - ^ ||Ai - PAilll 

i=l 

s 

<^(||Ci-PCi||| + Ci) 
i=l 


IIA,-PA, 

i=l 

<-||A-PA||2 

“ 2 " 


Consider the right hand side of the first inequality, it is equal to 


G - PG||| + 

i=l 


Let constant c = Q, we have: 

i||A - PA||| < ||G - PGIll + c < ^||A - PA||| 

Therefore, G is a {4k, l/2)-projection-cost preserving sketch of A. ■ 

For convenience, we use Pq to denote a rank-/c projection matrix which provides the best 
rank-A: approximation G^ = PqG to matrix G. We also use Pa to denote a rank-/c projection 
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matrix which provides the best rank-/c approximation = Pa A to matrix A. Thus, 


A - CC^AIll < ||G - CC^GIll + c 

< 5||G-PgG||| + c 

< 5||G — PaG||p + c 
— 5 (||G — PaG||p + c) 

<^||A-PaA||J 


The first inequality is due to Lemma 64. G is a {Ak, l/2)-projection cost preserving sketch of A 
and C has rank at most Ak. The second inequality is based on Lemma 58. The third inequality is 
because Pg provides the best rank-Zc approximation to G. The fourth inequality used the fact c is 
non-negative. The fifth inequality held because G is a {Ak, l/2)-projection cost preserving sketch 
of A. Thus, 

IIA-CC^AIll < 15||A-Afclll 
Combining with equation 12, we have: 


E 


lA-CC'A 


< E 


|A-n^,fc(A)|| 


^ 30 

< (IH-e 

- ^ 50 


rank(A) 

E 

2=fc+l 


a: 


'(A) 


(l+0(e))||A—Afclll (13) 


Concluding the proof. Consider about 


E 


|A- CC^AIll 


< E 


rank(A) rank(A) 

|A-n|_^(A)|||] < u2(A) + 0(e) u2(A). (14) 


i=k+l 


i=k+l 


The expectation is taken w.r.t. the randomness in constructing C; hence, the term ^KA) 

is a constant w.r.t. the expectation operation, which means that Eqn. 14 implies the bound: 


E 


rank{A) 

|A-nJ^JA)||2 - a‘f{A) 


i=k+l 


rank(A) 

< 0{e) • ^ ^^^(A). 

2=fc+l 


Let Y := ||A — 11? ^(A)||| — e'?(A). E is a random variable with E > 0, because 


rank(n? ^(A)) < k. Markov's inequality on E implies that with arbitrary constant probability. 


rank(A) rank(A) 

|A-n?^^(A)|||- ^ af{A)<0{e)- ^^(A), 


i=k-\-l 


i=k-\-l 


equivalently. 


rank(A) rank(A) 

|A-n?^^(A)|||< af{A) + 0{e)- ^^(A). 


i=k+l 


i=k-\-l 


Using IIA — CC'A||p < ||A — H? ^(A)||p concludes the proof. 
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8.4.2 Proof of Eqn. 11 

We would like to apply Lemma 62 for the matrix U G algorithm. Note that we 

have already proved that the matrix C in the algorithm satisfies with probability at least 0.99: 
l|A-nS,(A)||| < (1 + 0(e))||A — Afcllp. Lemma 62 and a simple union bound conclude the 
proof. 

8.4.3 Running time 

Next, we give an analysis of the arithmetic operations required in various steps in the algorithm. 

1. Local Column Sampling: 0(nnz(A) log(ms) + m ■ poly(A:, e, log(s))) arithmetic operations in 
total. 

(a) 0(nnz(Aj) log(ms) + m ■ poly(A:, e, log(s))) for each 

(b) - 

2. Global Column Sampling: 0{sk^ + skm) arithmetic operations in total. 

(a) 0{sk^ + skm) to construct C. 

(b) - 

3. Adaptive Sampling 0{sk‘^m + mnk + k/e) arithmetic operations in total. 

(a) 0{k‘^m) for each locally; 0{wimk) for each and 0{mwi) for each -ipi. 

(b) 0(s) in total. 

(c) 0{k/£) in total using the method of [51] locally. 

(d) 0{k/e). 

4. Rank-/c matrix in span{C): 0{mnk/£^ + smk ^+ sk^ je^) arithmetic operations in total. 

(a) in total 

(b) 0{mwikj£^) for each AjWj and another 0{mk‘^e~*) for each C^(AjWj) 

(c) 0{sk‘^£~^) to find H in the server; then another 0{sk^£~^ + sk^£~^)) to update H lo¬ 
cally in each machine and another 0{sk^£~^) to find A locally in each machine; and 
0{smk‘^/e) to find U locally in each machine. 

8.4.4 Communication Complexity 

Assume that we can represent each element in the input matrix A with at most b bits. Also, we 
assume that one word has length b bits and b = 0(log(mns/e)). 

1. Local Column Sampling: 0{sk(l>) words. 

(a) - 

(b) 0{sk(l)) elements of A. 

2. Global Column Sampling:0(s/c(^) words. 

(a) - 


47 



(b) 0{sk(j)) elements of A. 

3. Adaptive Sampling: 0{s + (jik/e) words. 

(a) s integers each of which is representable with 0(log /c+log log(mns/e)) (from Lemma 61). 

(b) s integers (the i/s) each with magnitude at most n, hence representable with b bits. 

(c) 0((/)A:/e) elements of A. 

(d) - 

4. Best rank-A: matrix in the span of C: 0{s^k/e + sk‘^e~'^) words. 

(a) 0{sk(j)£~^) elements of A. 

(b) 0{sk‘^£~'^) numbers each of which can be represented with b bits. 

(c) 0{sk‘^£~‘^) numbers each of which can be represented with b bits. 

In total the communication complexity is 0{sk^l£ + sk^l£‘^) words. 

9 Faster Distributed PCA for sparse matrices 

We now explain how to modify certain steps of the distributed PCA algorithm in Section 8 in order 
to improve the total running time spent to compute the matrix U; this, at the cost of increasing the 
communication cost slightly. Specifically, we replace the parts 2-(a), 3-(a), and 4-(6) with similar 
procedures, which are almost as accurate as the original procedures but run in time proportional 
to the number of the non-zero elements of fhe underlying mafrices. Before presenting the new 
algorithm in detail, we discuss results from previous literature that we employ in the analysis. We 
remark that in the analysis below we have not attempted to optimize the constants. 

9.1 Background material 

9.1.1 Constant probability sparse subspace embeddings and sparse SVD 

First, we present the so-called sparse subspace embedding matrices of Clarskon and Woodruff [23]. 

Definition 65. [Sparse Subspace Embedding [23]] We call W G a sparse subspace embedding of 
dimension ^ < n if it is constructed as follows, W = '1' • Y, where 

• h : [n] ^ [^] is a random map so that for each i G [n], h{i) = ^',for G [^] w.p. 1/^. 

is a binary matrix with all remaining entries 0. 

• Y G is a random diagonal matrix, with each diagonal entry independently chosen to be 4-1 or 
— 1, with equal probability. 

Such sparse subspace embedding matrices have favorable properties which we summarize 
below: 

Lemma 66 ([23]). Let G have rank p and let W G R^^*^ be a randomly chosen sparse subspace 
embedding with dimension ^ = EL{p^£~‘^), for some 0 < e < 1. Then, 1) computing AW"*^ requires 
0(nnz(A)) time; 2) nnz(AW"'^) < nnz(A); and 3) with probability at least 0.99, and for all vectors 
y G simultaneously, 

(1 - e)||AVlli < llWAVlli < (1 + e)||AVlli- 
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Notice that the third property in the lemma fails with constant probability. Based on the above 
sparse subspace embeddings, Clarkson and Woodruff [23] described a low-rank matrix approxi¬ 
mation algorithm for A that runs in time nnz(A) plus low-order terms. 

Lemma 67 (Theorem 47 in [23]; for the exact statement see Lemma 3.4 in [19]). Given A G ]^mxn 
of rank p, a target rank 1 < k < p, and 0 < e < 1, there exists a randomized algorithm that computes 
Z G with Z^Z = Ifc and with probability at least 0.99, 

IIA - AZZ^Ill < (1 -h e) IIA - Afc|||. 

The proposed algorithm requires O (nnz(A) + {m + n) ■ poly{k,e~^)) arithmetic operations. We denote 
this procedure asZ = SparseSVD{A, k, e). 

Notice that the algorithm in the lemma fails with constant probability. 

9.1.2 High probability sparse SVD 

Lemma 67 above presents a randomized algorithm to compute quickly an SVD of a sparse matrix; 
however, this lemma fails with constant probability. Since in our fast distributed PCA algorithm 
we will use the lemma multiple times (specifically, s times, since we need such a fast SVD locally 
in each machine), the overall failure probability would be of fhe order 0{s) (by a union bound). 
To boosf this failure probability, we develop a simple scheme to obtain a high probability result. 

Before we proceed, we quickly review the Johnson Lindestrauss transform, which we need in 
the analysis. 

Lemma 68. [Theorem 1 in [5] for fixed e = 1/2] Let B G R™^”. Given /3 > 0,let r = 

Gonstruct a matrix S G each element of which is a random variable which takes values ±l/y/rwith 

equal probability. Let B = SB. Then, if hi and b, denote the ith column o/B and B, respectively, with 
probability at least 1 —n“^, and for all i = 1, ...n, (1 — ^)||bi ||2 < UbiHi < (l + ^)l|t>i|| 2 - Given B, it takes 
0(nnz(B) log n) arithmetic operations to construct B. We will denote this procedure as B = JLT(B, fd). 

Now, we are fully equipped to design the "high-probability" analog of Lemma 67: 

Lemma 69. Given A G of rank p, a target rank 1 < k < p, and 0 < 5, e < 1, there exists a 

randomized algorithm that computes Z G R”^^ with Z^'^Z = 1^ and with probability at least 1 — 5 — 1/n, 
||A - AZZ^III < 3 (1 -|- e) IIA — Afc|||. The proposed algorithm requires O (nnz(A) log^(|;)) -|- n ■ 
poly{k, e, log(^)) arithmetic operations. We denote this procedure asZ = SparseSVDBoosting{A, k, e, S) 

Proof. The idea is to use the algorithm in Lemma 67 and generate i.i.d. r = 0(log(|)) matrices 
Zi, Z 2 ,..., Zr, where Zj = SparseSVD{A, k, e), and || A — AZjZj||| < (1 -t- £)|| A — Afc|||, with 
probability at least 0.99 for a single Zj. 

Now, we could have chosen Z := Z* which minimizes || A — AZiZj |||, and this would have 
implied that ||A — AZZ^||| < (1 -|- e)|| A — Afc|||, with probability 1 — (5 (via a simple chernoff 
argument), but evaluating the errors ||A — AZjZ^||| is computationally intensive for the overall 
rurming time that we would like to obtain. 

Selection of Z. Instead, we use the Johnson Lindestrauss transform to speedup this computation 
(evaluating the errors). Hence, for every Zi, we (implicitly) set B^ := A — AZ^Zj and then we 
compute Bj = JLT(Bj, 1) (see Lemma 68). Now, we choose the Zj that minimizes ||Bj||p. 
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Correctness. For a fixed Z,, from Lemma 68 it follows that with probability 1 — 1/n it is: ^||A — 
AZiZjWp < ||Bj||| < |||A — AZjZj|||, where the failure probabilityisovertherandomnessof Sin 
Lemma 68. Now, for some Z* with probability at least 0.99 it is || A — AZj Zjlll <(i + ^)||A-A,|||, 
where the failure probability is over the randomness in constructing Z*. Overall, given that n > 1, 
with probability at least 0.49 (over the randomness of bofh S and Z*; the failure probability follows 
via a union bound) for each single Zj it is: ||Bi||| < |(1 + e)|| A — Afc|||. Hence, since r = 0(log j) 
and we have picked Z as the Z, that minimizes ||Bj|||, it follows fhaf with probability at least 1 — (5, 
||Bj||| < |(l + e)||A — Afc|||. Combining with ^IIA — AZjZjlIp < ||Bj|||, and using a union bound, 
it follows that with probability I — 1/n — 6: ||A — AZjZ^||| < |(1 + e)|| A — Afc|||. 

Running time. The following costs occur 0(log( j)) times: (i) O (nnz(A) + (m + n) • poly{k, 
to find Zj (via Lemma 67); (ii) 0(nnz(A) log n + nk log(n)) to evaluate each cost, i.e., find ||Bi|||. 


9.1.3 Fast column sampling techniques 

Section 8.1.1 presented the column sampling algorithms that we used in our distributed PCA 
algorithm in Section 8. Next, we present the "fast" analogs of those algorithms. To design such 
fast analogs we employ the sparse subspace embedding matrices in the previous section. All the 
results in this section developed in [19]. There is a small difference in Lemma 70 below, hence we 
present a short proof for completeness. Specifically, the original result in [19] has a constant failure 
probability; here, via using standard arguments, we extended this result to a high probability 
bound. 

Lemma 70 (Input-Sparsity-Time Dual-Set Spectral-Frobenius Sparsification [19].). Let V G 
be a matrix with w > k and V^V = 1^. Let E G arbitrary matrix. Let S be a failure 

probability parameter. 

For i = 1,2,. .., r = 0(log j), let Bj = EW^ G where Wj G is a randomly chosen 

sparse subspace embedding with ^ = 0{k‘^/e^) < w,for some 0 < e < 1 (see Definition 65 and Lemma 66), 
and run the algorithm of Lemma 57 with V, Bj, and some i with k < i < w, to construct Si G 
Choose some Sfrom the Sfs - see the proof for the details. 

Then, with probability at least 1 — 5, al (V"^S) > (^1 — y/kftj and ||ES||| < • l|E||p. 

The matrix S can be computed in O (log (^) • (nnz(A) -|- iwk"^ -|- mkf /e^ -|- m^)) time. INe denote this 
procedure as S = BssSamplingSparse(V, A, r, e, 6). 

Proof. The algorithm constructs the Si's as follows. Si = BssSampling(V, Bi, r) - see Lemma 57. 

From that lemma, we also have that (V^Si) > (^1 — \/k/iJ and ||bJ^S||| < ||bJ^|||, i.e., 

||WiA^Si||| < ||WiA^|||. Since Wi is a subspace embedding, from Lemma 66 we have that with 
probability at least 0.99 and for all vecfors y G R” simulfaneously, (1 — ^)l|AVlli< llWiAVlIi- 
Apply this r times for y G R” being columns from Si G and fake a sum on the resulting 

inequalities: (1 — ^)I|aTs,||| < ||WA^Si|||; also, apply this n times for fhe basis vectors in R” 
and take a sum on the resulting inequalities: ||WA^||| < (1 -|-e) ||A^||p. Combining all these 
inequalities together, we conclude that with probability at least 0.99, ||A^Si||| < ■ ||A^|||. 

To summarize, for each Si and with probability at least 0.99, a/, (V^Si ) > ^1 — \Jkli^ and 
l|ES,||| < ■ ||E|||. 
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Selection of S. Now that we have r matrices each of which satisfying the above bounds with 
probability 0.99 we select one of them as follows: 1) we compute all al (V^S*) and sort them from 
the largest to the smallest value; 2) we compute all ||A^Sj||| and sort them from the smallest to 
the largest value; 3) we choose an S* which occurs in the top 2/3 fraction of each of the lists. 

Correctness. Let Xi be an indicator random variable: Xj = 1 if the f th matrix Si satisfies (v'^Si) > 

^1 — \/klij , and Xi = 0, otherwise. Then, Pr[Xj] > 0.99. Let X = Xi. The expected value of 
X is = IE [X] = 0.99r. A standard Chernoff bound applied on X gives: Pr[X < (1 — a)fix] < 
e Let (1 — Q;)0.99r = 0.75r, i.e., a = 0.14/0.99. Applying this to the Chernoff bound, we get: 

Pr[X < 0.75r] < Replacing r = 0(log(l/(i)): Pr[X < 0.75r] < 5/2. This means precisely 

that 3/4 of the fraction of the S/s satisfy with probability 1 — 5/2 : cr| (V^S*) > ^1 — \/k/i^ . 
Similarly, let Yi be an indicator random variable: 1/ = 1 if the ith matrix S* satisfies ||ESj||| < 

■ l|E|||, and Yi = 0, otherwise. Then, Pr[yi] > 0.99. Let Y = Y/l The expected value of 
F is ;Uy = E [Y] = 0.99r. A standard Chernoff bound applied on Y gives: Pr[y < (1 — a)/Uy] < 

fiyoc 

e ^. Let (1 — Q;)0.99r = 0.75r, i.e., a = 0.14/0.99. Applying this to the Chernoff bound, we get: 
Pr[y < 0.75r] < Replacing r = 0(log(l/(5)): Pr[y < 0.75r] < 5/2. This means precisely 

that 3/4 of the fraction of the Sj's satisfy w.p. 1 — 5/2 : ||ESi||| < • l|E|||. 

Now a simple union bound indicates that with probability at least 1 — 5, the following two 
events happen simultaneously: X > 0.75r and Y > 0.75r. Since we select an element S* on top of 
the two sorted lists, it follows that this element should have Xj = 1/ = 1. 

Running time. The following costs occur 0(log( j)) times: (i) 0(nnz(A)) to find B*. (i) 0{£wk‘^ + 
m^) to find S* (via Lemma 57); and (iii) 0{k‘^i + m£) to evaluate each cost and, i.e., find a/. (V^Sj) 
and ||ESi|||. Additionally, 0(log(|) log log( j)) time in total is spent in sorting. ■ 

The following Lemma is the "fast" analog of Lemma 58. The failure probability in the lemma is 
constant, but this is sufficient for our purposes since we apply this lemma only once in the global 
column sampling step of the distributed PCA algorithm in Section 9.2. 

Lemma 71 (Input-Sparsity-Time constant factor column-based matrix reconstruction; Lemma 6.3 
in [19]). Given matrix G G of rank p and a target rank k there exists a randomized algorithm 

that runs in O (nnz(A) • log a -|- m • poly{log a, k, e~^)) time and selects c = Ak columns ofG to form a 
matrix C G such that with probability at least 0.69: 


rank(G) 

||G - CCiGIll < 4820 • 

i=k+l 

We denote this procedure as C = DeterministicCssFrobeniusSparse{G, k, c). 

Finally, the lemma below presents the "fast" analog of Lemma 59. The failure probability in 
the lemma is, again, constant; we have not attempted to obtain a high probability bound since we 
employ this lemma only once in the algorithm in Section 9.2. 

^The original Lemma 6.3 in [19] has the assumption that k < p, but this assumption can be dropped having the 
result unchanged. The only reason the assumption k < p exists is because otherwise column subset selection is trivial. 
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Lemma 72 (Input-sparsity-time Adaptive Sampling [19]). Given A G ^nd V G 

Cl < n, m), de/ine the residual matrix = A — VV^A G Let # = JLT(^, 1). For j = 1,..., n, 

let pj be a probability distribution such that, for some positive constant /3 < 1, pj > |||, 

where is the j-th column of the matrix Sample C 2 columns from A in C 2 i-i.d. trials, where in each 
trial the j-th column is chosen with probability pi. Let C 2 G contain the C 2 sampled columns and 

let C = [V C 2 ] G R™x(ci+c 2 ) contain the columns 0 /V and C 2 . Then, for any integer k > 0, and with 
probability 0.9 — ^ 

IIA - ng,,(A)||| < ||A - A,||2 +^\\A- VVtAlll. 

P ■ C2 

Given A and V, the algorithm takes 0(nnz(A) log n + mci log n + mcf) time to find C 2 . We denote this 
sampling procedure as C 2 = AdaptiveColsSparse{A, V, C 2 , /3). 

9.1.4 Fast Low-rank matrix approximations within a subspace 

Finally, we present the fast analog of the result in Section 8.1.3. The failure probability in the 
lemma is, again, constant; we have not attempted to obtain a high probability bound since we 
employ this lemma only once in the algorithm in Section 9.2. 

Lemma 73. Let A G be the input matrix and V G R™^"^ be the input subspace. We further assume 
that for some rank parameter k < c and accuracy parameter 0 < e < 1 : 

11"^ “ nv^fc(A)||| < (1 -I- 0(e))||A — Afc|||. 

Let V = YAf be a qr decomposition of V with Y G and ^ G Let H = Y^^AW^ G 

where G R""^^ with ^ = 0{c/e^), is a sparse subspace embedding matrix (see Definition 65 and 
Lemma 66). Let A G R^^^ contain the top k left singular vectors o/H. Then, with probability at least 0.99, 

||A-YAATyTa|| 2 < (l + e)||A-Afc||2. 

Y and A can be computed in 0(nnz(A) -|- mc^) time. We denote this procedure as 

[Y, A] = ApproxSubspaceSVDSparse(A, V, k, s). 

Proof. This result was proven inside the proof of Theorem 1.5 in [36]. Specifically, fhe error bound 
proven in [36] is for the transpose of A (also Y, A are denoted with U, V in [36]). The only require¬ 
ment for the embedding matrix W (denoted with P in the proof of Theorem 1.5 in [36]) is to be a 
subspace embedding for Y^ A, in the sense that W is a subspace embedding for A in Lemma 66. 
Since our choice of W with ^ = 0((? je^') satisfies fhis requirement we omit the details of the proof. 
The running time is 0(nnz(A) -|-mc^): one can compute (i) Y in O(mc^); (ii) H in 0(nnz(A) -|-mc^); 
and (iii) A in 0(c(^ min{c, ^}). The failure probability 0.01 from Lemma 66. ■ 

9.2 Detailed description of the algorithm 

This algorithm is very similar to the algorithm in Section 8.2; we only replace the parts 2-(a), 3-(a), 
and 4-(6) with faster procedures. 

Input: 

1. A G partitioned column-wise A = (Ai A 2 ... A^); for i = 1 : s, A^ G 

2. rank parameter k < rank(A) 
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3. accuracy parameter e > 0 

4. failure probability 6 

Algorithm 

1. Local Column Sampling 

(a) For each sub-matrix compute containing £ = 0(k) columns from 

Ai as follows: Ci = AiSi. Here, has dimensions Wi x £ and is constructed as follows: 
Si = BssSamplingI{Ai, 4fc, 1/2,0.01/s) (see Lemma 56). 

(b) Machine i sends Ci to the server. 

2. Global Column Sampling 

(a) Server constructs m x {s ■ £) matrix G containing (s • £) actual columns from A as follows: 

G = (Cl C 2 ... Cs) ■ Then, server constructs C € via choosing ci = 4k columns 

from G as follows: C = DeterministicCssFrobeniusSparse{G, k, ci) (see Lemma 71). 

(b) Server sends C to all the machines. 

3. Adaptive Column Sampling 

(a) Server initializes the random seed and communicates it to all the machines. Each machine con¬ 
structs the same S G with r = ^i/ 2 )al(i/ 2 )a log each element of which is a random vari¬ 

able which takes values -FXj^Jr with equal probability. Machine i finds ’4'i = SAi — SCCIA^ g 

then computes fii as it was described in Lemma 61. Machine i sends di to server. 

(b) Server computes probability distribution gi = . Server samples i.i.d. with replacement 

[SOlc/s] samples (machines) from gi. Then, server determines numbers ti (i = l,2,...,s), 
where ti is the number of times the fth machine was sampled. It sends the t/s to the machines. 

(c) Machine i computes probabilities g] = ||x|||/||’®'i|lF 0 = 1- ^0/ where x is the jth column of 

^ 1 . And now machine i samples ti samples from it's local probability distribution and sends 
the corresponding columns to the server. Let C 2 = = \ 50fc/e ]. 

(d) Server collects the columns and assigns them to C G Server constructs C to be the 

m X (ci -I- C 2 ) matrix: C = (C; C) . Let c = ci -|- C 2 = 4A: -I- [SOfc/e]. 

4. Rank-fc matrix in the span of C 

(a) Server sends C to all the machines and each machine computes (the same) qr factorization of 
C: C = YR where Y G has orthonormal columns and R G is upper triangular. 

(b) Server initializes the random seed and sends this to each machine, such that each machine 

generates the same matrix \I/ G for ^ = 0(c^/e^), (see Definition 65). Machine i generates 

Di G R”^™* such that implicitly D = (Di D 2 ... D^) is an n X n matrix each element of 

which is ±1, with probability 1/2. Each machine constructs implicitly G R^^^'G 

Implicitly all machines together generate W = (Wi W 2 ... Wg) , with W G R^^" and W 

~ T -r 

is a subspace embedding as in Definition 65. Machine i computes = C (A^W/) G R'^^'*. 
Machine i sends to the server. 

(c) Server computes H = Xi=i Hi G R'^^^ and sends this back to all the machines. Now machines 
compute H := R~^ • H(:= Y^A^W/), and then they compute A G R°^^ to be the top k left 
singular vectors of H. Each machine computes U = Y • A G 
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Discussion. A few remarks are necessary for the last two stages of the algorithm. The third stage 
(adaptive column sampling), implements the adaptive sampling method of Lemma 72. To see this 
note that each column in A is sampled with probability 

||x||i/||^|||, 

where x is this column in This follows from Lemma 60. Overall, this stage constructs C such 
that C = AdaptiveColsSparse{A, C,C2,1/2). 

The last stage in the algorithm implements the algorithm in Lemma 73. To see this, note that 
W satisfies the properties in the lemma. Hence, 

[Y, A] = ApproxSubspaceSVDSparse{A, C, k, s). 


9.3 Main result 

The theorem below analyzes the approximation error, the communication complexity, and the 
rurming time of the previous algorithm. 

Theorem 74. The matrix C G with c = 0{k/e) columns of A satisfies w.p. 0.59 — — 26: 

( rank(A) \ 

E • (15) 

i=k+l j 

The matrix U G with k orthonormal columns satisfies w.p. 0.58 — — 26: 


A-UU^AIll < (l + 0(e)) • 


^ rank(A) 

E 




(16) 


Let each column of A has at most f non-zero elements. Then, the communication cost of the algorithm is 
O [sk(f)e~^ + sk^e~^) and the running time is 

O ^nnz(A) • log^ +{m + n)s ■ poly{k,e~^, log • 


9.4 Proof of Theorem 74 
9.4.1 Proof of Eqn. 15 

From Lemma 72, we have that with probability at least 0.9 — 

rank(A) 

||A-CclA|||<||A-nE(A)|||< E ^'(A) + ^-||A-CCtA|||. (17) 

i=k-\-l 

According to Lemma 64, G is a {Ak, l/2)-projection-cost preserving sketch of A. It means that there 
exists a constant c which is independent from projection matrix P and we have the following: 

i||A - PAIll < ||G - PGIll + c < ^||A - PA||2 
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For convenience, we use Pq to denote a rank-/c projection matrix which provides the best rank-/c 
approximation Gk = PqG to matrix G. We also use Pa to denote a rank-A: projection matrix 
which provides the best rank-/c approximation = Pa A to matrix A. With probability at least 
0.68, the following will be held. 


A - CC^AIll < ||G - CC^GIll + c 


< 

< 

< 

< 


40ZU||' 


4820||G -PaGIII 

4820 (||G-PaG|| 

4820 ,,. „ .,,2 

^||A-PaA||2 


-h c 


C 

F + c) 


The first inequality is due to Lemma 64. G is a (4A:, l/2)-projection cost preserving sketch of A 
and C has rank at most 4/c. The second inequality is based on Lemma 71. The third inequality is 
because Pg provides the best rank-Zc approximation to G. The fourth inequality used the fact c is 
non-negative. The fifth inequality held because G is a (4/c, l/2)-projection cost preserving sketch 
of A. Thus, 

||A - CC^AIll < 4820||A - Afc||| 

Concluding the proof. Combining with Equation 17, we have: 

rank(A) rank{A) 

||A-CcU|| 2 <||A-nJ_^(A)||2 < u2(A) + 0(£). u2(A). (18) 

i=k-\-l i=k+l 

9.4.2 Proof of Eqn. 16 

We would like fo apply Lemma 73 for fhe matrix U G in the algorithm. Note that we already 
proved that the matrix C in the algorithm satisfies with probability 0.59 — — 25: 

l|A — n^^^(A)||| < (1 -I- 0(£))||A — Afcllp. 

Lemma 73 and a simple union bound conclude the proof. 

9.4.3 Running time 

1. Local Column Sampling: 0(nnz(A) log(ms) -|- m ■ poly(/c, e, log(s))) arithmetic operations in 
total. 

(a) 0(nnz(Aj) log(ms) -|- m ■ poly(A:, e, log(s))) for each 

(b) - 

2. Global Column Sampling: O (nnz(A) -logk + m ■ poly(/c,e“^)). 

(a) O (nnz(A) ■ \ogk + m ■ poly{k, to construct C - from Lemma 71. 

(b) - 
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3. Adaptive Sampling: 0(nnz(A) • logn + m • s • poly(/c,e ^,logn)). 

(a) First, we analyze the cost to compute the matrix 0{k‘^m) for each locally; then, 
we compute (SAj) — (((SCi)C|)Aj). The costs are: SA* := D takes 0(nnz(A,)) time, 
SCi := G takes 0(nnz(A,) log(n)) time, GC| := H takes 0{m\og{n)k/e^) time, HA := 
L takes 0(nnz(Ai) log(n)) time, and D — L takes 0(n log n) time. So, the total time to 
compute one is 0(nnz(Aj) logn + m ■ poly{k, logn)). 

Also, there is a cost of computing which is 0(log n • Wi) arithmetic operations. 

(b) 0(s) in total. 

(c) 0{k/e) in total using the method of [51] locally. 

(d) 0{k/e). 

4. Rank-/c matrix in span{C): 0(nnz(A) + m ■ s ■ poly{k, e“^)) arithmetic operations in total. 

(a) 0{smk^je^) in total. 

(b) 0(sn) in total to generate s times the same '4'. Then another 0(n) in total to generate 
the Dj's. For all A,Wj we need 0(nnz(A)) operations and then another 0{mk^e~’^) for 
each • (A^Wj) 

(c) 0{sk^e~^) to find H; then another 0{sk^e~^ + sk‘^e~^)) to update H locally in each 
machine and another 0(sA:^e“®) to find A; and 0{smk‘^/e) to find U (s times). 

9.4.4 Communication Complexity 

Assume that we can represent each element in the input matrix A with at most b bits. Also, we 
assume that one word has length b bits and b = 0{log{mns/e)). 

1. Local Column Sampling: 0{sk<l)) words. 

(a) - 

(b) 0{sk(l)) elements of A. 

2. Global Column Sampling:0(sA:(;/)) words. 

(a) - 

(b) 0{sk(l)) elements of A. 

3. Adaptive Sampling: 0{s + 4>k/e) words. 

(a) s integers each of which is representable with 0(log /c+log \og{mns/e)) (from Lemma 61). 

(b) s integers (the f/s) each with magnitude at most n, hence representable with b bits. 

(c) 0((/>/c/e) elements of A. 

(d) - 

4. Best rank-A: matrix in the span of C: 0{s(j)k/s + sk^e~^) words. 

(a) 0{sk(j)e~^) elements of A. 

(b) 0{sk^e~^) numbers each of which can be represented with b bits. 

(c) 0{sk^£~^) numbers each of which can be represented with b bits. 

In total the communication complexity is 0[sk4>le + sk^ je^) words. 
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10 Lower bounds 


Only in Section 10.1, we discuss the result in arbitrary partition model. In Section 10.2, Section 10.3 
and Section 10.4, we discuss results in column partition model which provides stronger lower 
bounds than in arbitrary partition model. 

10.1 Dependence on bits for distributed PCA in arbitrary partition model 

Lemma 75 (Theorem 8 in [54]). There are two machines, and each of them hold a matrix Ai G 

Let A = Ai + A 2 . Suppose < 2n, the communication of computing a rank-k matrix U with constant 

probability such that 

IIA - UU'^AIll < (1 + e) • IIA - Afclll 

needs n(l/e^) bits. 

Proof (from [54]). The reduction is from GHD problem. 

Lemma 76 (Theorem 1.1 in [20]). Each of two machines has an n-bit vector. There is a promise: either 
the inputs are at Hamming distance less than n/2 — cy/n or greater than nj2 + Cy/n. If they want to 
distinguish these two cases with constant probability, the communication required Tl{n) bits. 

Without loss of generality, we assume 1/e^ is an integer, and machine 1 and machine 2 have 
X, y G {—1,1}^^^ respectively. There is a promise that either x^y < —2/e or x^y > 2/e holds. 
Consider about the following protocol: 

1. Machine 1 constructs Ai G R(^+i)^(^/^^+^), and machine 2 constructs A 2 G 


Ai = 


/ x^e 

0 

0 


V 0 


0 

V2 

0 


0 

0_ 

y/2{l+e) 


0 \ 

0 

0 

y/2{l+e) 

e / 


A 2 


/ yTe 0 0 ... 0 \ 

0 0 0 ... 0 

0 0 0 0 0 


\ 0 0 0 ... 0 / 


2. The server computes U such that 

IIA - UU'^AIll < (1 + e) • IIA - Afclll 
and sends U to both of two machines, where A = Ai + A 2 . 

3. LetP = Ifc+i — UU^. Each machine construct v = II2 

4. Each machine checks whether vf < 1(1 + e). If yes, return the case x^y > 2/e. Otherwise 
return the case x^y < 2/e 

Suppose the server successfully computes U. Since the rank of A is at most k + 1: 

||A-UU^A||| = ||PA||| 

= l|vA||2 

= vAA^v^ 
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An observation is that 



/ 

llx + yllie^ 

0 

0 

0 

\ 



0 

2 

0 

0 


AA^ = 


0 

0 

2(1+£) 

0 



V 

0 

0 

0 

2(1+£) 

••• 

/ 


We have || A — Ak\\p < 2. Then, 


2(1 + e) 


2(1+ £) 




i=3 


(1 - V? - vi) 


< 2(1 +£) 


We have vf + v^ > 1 — e^. When > 2/e, ||x + > 2 + 4e 


(1 + e)|| A — Afcllp — 2(l + e) 

> v^(2 + 4e) + 2 v2 

> 2 - 2^2 + 4ev? 


So, vf < ^(1 + e). Whenx'^y < —2/e, ||x + y\\le‘^ < 2 — 4e 

(1 + e)|| A — Afclll = ||x + y|| 2 e^(l + e) 

> vfllx + yllle^+ 2v| 

= 2(v? + v|) - (2 - ||x + y||ie^)vf 

> 2(1-e^) - (2-||x + y||ie^)v? 


So, 


vl > 


2(l-,2)-||x + y| 




2-||x + y||ie2 


Therefore, machines can distinguish these two cases. The only communication is computing U. 
This cost shoulde be the same as n(l/e^) bits lower bound of the gap hamming distance problem. 


10.2 Lower bounds for distributed PCA on dense matrices 

This section provides a communication cost lower bound for the Distributed PCA problem of 
Definition 17. Specifically, we describe the construction of an m x n matrix A (hard instance), and 
formally argue fhaf for this A, any k < 0.99m, and any error parameter C with 1 < C < poly{skm), 
if there exists some algorithm to construct anm x k matrix U such that, with constant probability, 
||A-UU^A III < C • IIA — Afclll, then this algorithm has communication cost Q.{skm) words. 

10.2.1 Preliminaries 

We use the notation Gk,m to denote the set of /c-dimensional subspaces of M™, which we identify 
with corresponding projector matrices Q G with rank(Q) = k < m, onto the subspaces, 

i.e., 

Gk,m = {Q G = Q,rank(Q) = k < m.} 
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We also need a description of a subset from Gk,m- 

C|(P) = {QGGfc,^:||P-Q||2<<5}. 

This is the set of those projectors Q G Gk,m which differ in operator norm by at most (5 from 
another fixed projector P G Gk,m- In our analysis below we also need a result from [38]: 

Theorem 77. (Net Bound-Corollary 5.1 of [38]) For any m,k, 5 > 0, there is a family J\f = {P^,...,P^}, 
^ ^ 2nik(m-k)iog(i/5))^ pi g Cf(P*) n C^iF^) = for all i f j. 

Theorem 77 proves the existence of a large, but finite, set M of projection matrices, such that if 
one considers fhe ball of matrices of operator norm at most <5 centered at each matrix in M, then 
these balls are disjoint. Our hard instance matrix A is constructed from some member in M. 


10.2.2 Hard instance construction 


Recall that in the column-partition model of Definition 15 there are s servers holding matrices 
Ai,..., As, respectively, where Aj has m rows and some subset of Wi columns of some m x n 
matrix A. Notice that 'ffwi = n. First of all, for arbitrary m, s we assume^ that n > sm. Below, we 
describe a specific construction for a matrix A. 

First of all, we set 5 = 1/(skm), the parameter to be used in Theorem 77. Next, fix a family M 
of subspaces of ^ with the guarantees of Theorem 77. Let Pr = RR^ be a uniformly random 
member of M, where R G has orthonormal columns (any projector can be expressed as 

such a matrix, where the columns of R span the subspace that P projects onto). The entries of 
Ai are equal to the entries of R, each rounded to the nearest integer multiple of 1/B, where 
B = poly(s/cm) is a large enough parameter specified later. We denote this rounded matrix with 

R G So, if Op- is the (f, j)th entry in Ai, then aij = with kmin = argmin \Rij — -g|. Each 

kei 

Ai for i = 2, 3,... , s — 1 is 1/R times the m x m identity matrix. Finally, A^ is the m x t matrix of 
all zeros with t = n — (s — l)m — k. I.e., 


A = 


/■p J^T J^T 




10.2.3 Intermediate results 


First, we give a bound regarding the best rank k approximation for the above matrix A. 
Lemma 78. For the matrix A G in Section 10.2.2, and any k < 0.99m : || A — Afc||| < 
Proof. We have. 


A — A;;. Up < 


< 


< 


||A - AiAtAIII 

^||(I-AiA1)A.||2 

i>l 


i>l 


(s-2) 


m 


sm 


^As, otherwise we can choose a value s' < s so that n — m < s'm < n and apply the argument of Theorem 82 with 
s replaced with s'. Then, the lower bound in Theorem 82 will then be fi{s'mk), which assuming n > 2d, is an fi{n) 
communication lower bound 
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where the first inequality uses the fact that Ai A| is a matrix of rank af mosf k, fhe first equality 
uses the fact that (I — Ai a|) Aj is the all-zeros matrix, the second inequality uses that a projector 
cannot increase a unitarily invariant norm, and the third inequality follows by construction. ■ 

Next, we bound, in the operator norm, the difference of Ai from the matrix R. The lemma 
follows from the fact that Ai G ^nixk jg obtained by R G by rounding each entry to the 

nearest integer multiple of 1/R, and then summing the squared differences across all km entries. 

Lemma 79. (Precision Lemma) ||Ai — R ||2 < 

Proof. ||Ai - R||2 < ||Ai - R||2 <!^. ■ 

Next, we prove a pure linear algebraic result. The following lemma captures the fact that if 
some P G ^ is close to QP (in Frobenius norm) for Q G Gk^m, then P is also close to Q (in 
Frobenius norm). We will use this lemma later for P = P^. 

Lemma 80. (Error Measure) Let P, Q G Gk,m ||P — QP||| < A. Then ||P — Q||| < 2A. 

Proof. By the matrix Pythagorean theorem, 

fc=||Q||| = ||QP||| + ||Q(I-P)|||. (19) 

Also by the matrix Pythagorean theorem, 

A:= ||P||2 = ||QP||2 +||(i_Q)p||2^ 

and so using the premise of the lemma, ||QP||p > k — A. Combining with (19), 

||Q(I-P)|||<A. (20) 

Hence, 

||P-Q||2 = ||(P-Q)p||2+||(p_q)(i_p)||2 

= ||P-QP||| + ||Q-QP||| 

< A + ||Q(I-P)||2 

< 2A, 

where the first equality follows by fhe matrix Pythagorean theorem, the second equality uses 
P^ = P (since P is a projector), the third equality uses the bound in the premise of the lemma, and 
the fourth inequality uses (20). ■ 


10.2.4 Main Argument 


Before presenting the main theorem, we give an intermediate technical lemma. 


Lemma 81. (Implication of Correctness) Let A G be the hard instance matrix in Section 10.2.2 and 
let A be distributed in s machines in the way it was described in Section 10.2.2: 


A = 


(vi At At 

S'*-™ 


j^I-m Omxt) • 


Suppose n = fl(sm), k < rank(A), and 1 < C < poly(skm). A is revealed to the machines by just 
describing the entries in A and without specifying any other detail regarding the construction. Given 
this A and assuming further that each machine knows some projector matrix Q G with rank at 

most k such that I|A-QA||f < CIIA — A^IIf, there is a deterministic algorithm (protocol) which, upon 
termination, leaves on each machine the matrix R G which was used to construct Ai. 
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Proof. First, we prove that ||RR^ — QII 2 < Then, using this bound, we describe a protocol 

that deterministically reveals the matrix R which was used to construct Ai. 

Using Lemma 78 and the premise in the lemma, it follows that: || A — QA||| < C^, which in 
particular implies that 


|Ai - QA 


lIlF ^ 


sm 


We further manipulate the term || Ai — QAi||f as follows: 


Ai — QAiIIf = 


> 

> 

> 


II(i-Q)Ai||f 

11(1 - Q)RRT + (I - Q)(Ai - RRT)||f 
11(1 - Q)RR^||f - ||(I - Q)(Ai - RRT)||f 
11(1 - Q)RRT||f - III - QIIfIKAi - RRT)||2 

11(1 - Q)RR^||f - + Vk), 


( 21 ) 


where the first inequality is the triangle inequality, the second inequality uses sub-multiplicativity, 
and the third inequality uses the triangle inequality, i.e., that ||I — Q||f < ||I||f + ||Q||f/ and Lemma 
79. Combining this bound with (21), it follows that || (I — Q)RR^ ||f < point 

we would like to apply Lemma 80 to conclude that ||RR^ — Q||f is small. To do so, we need RR^ 
and Q to be of rank k. While R has rank k by construction, Q may not. However, increasing the 
rank of Q cannot increase the error || A—QA||f. Hence, if rank(Q) < k, given Q, the protocol in the 
premise of the lemma allows each server to locally add standard basis vectors to the column space 
of Q until the column space of Q becomes k. This involves no communication and each server 
ends up with the same new setting of Q, which for simplicity we still denote with Q. Therefore, 
rank{Q) = k. Applying Lemma 80, it now follows that 

II T ^11 iiT-kT-vT ^11 CSTfl ‘2Tn'\/~k\ 

||RR^ — QII 2 < ||RR^ — QIIf < ( — — -1- — — j . 


By setting R to be a sufficiently large poly (s/cm), we have ||RR^ — Q II 2 < 27i7i- 

Therefore, given Q, by Theorem 77 there is a unique matrix P* in Af for which Q € C‘(P-), 
with S = Indeed, since ||R — QII 2 < < <5/2, there cannot be two matrices P* and P-^ for 

i f j for which Q G C'|(P*) and Q G C'|(p5) since then by the triangle inequality ||P* — P-^ II 2 < <5, 
contradicting the construction of M. Hence, it follows that each machine can deterministically 
identify this P*, by enumeration over M, and therefore compute R, via, for example, an SVD. ■ 

We are now fully equipped to present the main argument about the communication cost lower 
bound for distributed PCA. 


Theorem 82. (Main) Let A G he the hard instance matrix in Section 10.2.2 and let A be distributed 
in s machines in the way it was described in Section 10.2.2: 


A = (R 


J-T 


J_T 


J-T 


^mxt 


)• 


Suppose n = U(sm), k < .99m, and 1 < C < poly{skm). Assume that there is an algorithm (protocol) 
which succeeds with probability at least 2/3 in having the i-th server output QA*,/or all i, where Q G 
j^mxm j-g ^ projector matrix with rank at most k such that ||A — QA||f < UHA — A^Hf. Then, this 
algorithm requires Ll{skmlog{skm)) bits of communication. 
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Further, the bound holds even if the input matrices Aj to the servers have all entries which are integer 
multiples ofl/B,for a value B = poly{skm), and bounded in magnitude by 1, and therefore all entries can 
be specified with 0{log{skm)) bits. 

Note that assuming a word size of Q{log{skm)) bits, we obtain an Fl{skm) word lower bound. 

Proof. If the protocol succeeds, then, since each server outputs QAj, and Aj is 1/B times the 
identity matrix for all i = 2,3, ...s — 1, each of those servers can compute Q. One of those servers 
can now send this Q to the first and the last server. It follows by Lemma 81, that each server can 
identify Pr and R. 

Letting £ be the event that the protocol succeeds, and letting 11* be the ordered sequence of all 
incoming and outgoing messages at the i-th server, it follows that 

/(n*; Q I = fl(log |AA|) = Q{kmlog{smk)), 

where/(X;y \iF) = H{X \ F)—H{X \ Y,F) is the mutual information between random variables 
X and Y conditioned on F. To see the first equality in the above, observe that I (11*; Q | = R(Q | 

£) — H{Q I n*,T), and conditioned on £, there is a uniquely determined matrix Pr each server 
identifies, which is uniform over a set of size |AA|, and so iT(Q | £) > R(Pr) = fl(log 2 \Af\). 

Here, H{X) is the Shannon entropy of a random variable X, given hy H{X) = X^^-PrfA = 
x]log(l/Pr[A = x\), and H{X \ Y) = = y\H{X \Y = y) is the conditional Sharmon 

entropy. 

Hence, if Z is an indicator random variable which is 1 if and only if £ occurs, then using that 
I{X;Y) > I{X;Y \ W) — H{W) for any random variables X, Y, and W, and that I{X;Y \ W) = 
Pr [IP = w\I{X-,Y I W = w), we have 

/(n*;Q) > /(n*;Q|Z)-l 

> /(n*;Q I Z = l)Pr[Z = 1] - 1 
= £L{km\og{smk)), 

mapping X to 11*, Y to Q, and IP to Z in the mutual information bound above. It follows that 

a{km\og{smk)) < /(tf ;Q) < H{YP) < Ep^,,a,,rf[|n*|], 

where |n*| denotes the length, in bits, of the sequence 11*, and rand is the concatenation of the 
private randomness of all s players Note that the entropy of 11* is a lower bound on the ex¬ 
pected encoding length of |n*|, by the Shannon coding theorem. By linearity of expectation, 
Er ,.and||n*| = £l{skm log{smk)), which implies there exists an R and setting of rand for which 
- ||n*| = £l{skm log{smk)). It follows that the total communication is £l{smk log{smk)) bits. ■ 

10.3 Lower bounds for distributed PCA on sparse matrices 

Applying the previous theorem with m = f gives a communication lower bound for sparse ma¬ 
trices. 

Corollary 83. Let A G be the hard instance matrix in Section 10.2.2 (with m replaced with f in 
noting the number of rows in A): 

(R ' b ^ 4 ' ' 3 ^ 4 ’ ‘ ‘ ‘ ' 3^4 0(^xt) • 

Suppose n = £l{s(f)), k < .99(j), and 1 < C < poly{sk(j)). Now, for arbitrary m consider the matrix 
A G lohich has A in the top part and the all-zeros (m — f) xn matrix in the bottom part. A admits 
the same column partition as A after padding with zeros: 
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A = 


R 


It 1 T 1 T n 

B^<t> bH ••• 

^(m—<f>)xk ^{m—if))xif) ■ ■ ■ ^{m—if))xif) ^(m—if))xt 

We denote the new sub-matrices with A* G 

Assume that there is an algorithm (protocol) which succeeds with probability at least 2)3 in having the 
i-th server output QAi, for all i, where Q G jg ^ projector matrix with rank at most k such that 

||A- QA||f < C||A — Afcllp. Then, this algorithm requires Tl{sk(f)\og{sk(j))) bits of communication. 
Note that assuming a word size ofQ{log{sk(f))) bits, we obtain an Tl(sk(j)) word lower bound. 

10.4 Lower bounds for distributed column-based matrix reconstruction 

In this section we develop a communication lower bound for the Distributed Column Subset Se¬ 
lection Problem of Definition 19. 

10.4.1 A Net of discretized Deshpande-Vempala hard instances 

We start by describing the construction of a special sparse matrix from the work of Desphande and 
Vempala [27], which was further studied by [17]. This matrix gives a lower bound (on the number 
of columns need to be selected in order to achieve a certain accuracy) for the standard column 
subset selection problem (see the following Fact). Suppose f > 2k/e. Consider ajf + l) xf matrix 
B for which the f-th column is ei -|- Cj+i, where e* is the i-th standard basis vector in Now 

define an (((/> -|- l)k) x (fk) mafrix A with k blocks along the diagonal, where each block is equal 
to the matrix B. Let n = {(j) + l)k. 

Fact 84. (Proposition 4 in [27], extended in Theorem 37 of [17]) Let A be as above. IfP is annxn projection 
operator onto any subset of at most kl(2e) columns of A, then ||A — PA||| > (1 -I- e)||A — Afc|||. 

To prove a communication lower bound for distribufed column subsef selection, we will need 
to transform the above matrix A by multiplying by a rounded orthonormal matrix, as follows. We 
also need fo prove a similar lower bound as in the previous fact for this transformed matrix. 

Lemma 85. Let A, k, f, 1/e, and n be as above. For any n x n orthonormal matrix L, let L denote 
the matrix formed by rounding each of the entries o/L to the nearest integer multiple of M, where M < 
1/iif + l)k)‘^,for a sufficiently large constant c > 0. IfP is annx n projection operator onto any subset 
of at most /c/(6e) columns o/LA, then ||LA — PLA||| > (1 -I- e)||LA — [LA]fc|||. 

Proof. We will show that, assuming 1/e < n/3, if P is an n x n projection operator onto any subset 
of at most k){2e) columns of LA, then ||LA — PLA||| > (1 -|- e/3)||LA — [LA]fc|||. The lemma will 
then follow by replacing e with 3e. 

Suppose S' is a subset of k) (2e) columns of LA for which 

||LA - PLA||2 < (1 + e/3)||LA - [LA]fc|||, (22) 

where P is the projection onto the columns in S and [LAj^ is the best rank-/c approximation to LA 
The proof sfrategy is to relate the left hand side of (22) to || A — QA||| where Q is the projection 
onto the columns of A indexed by the same index set S', and simultaneously to relate the right 
hand side of (22) to || A — Afc|||. 

We start by looking at ||LA — [LAJ^Hf. Notice that LA = LA-l-EA, where E is a matrix whose 
entries are all bounded in magnitude by M. Therefore, LA = LA -|- F, where F = EA and 

||F||2 < ||E||f||A||f < • (2(n - 1))^/^ < {2n^M‘^y^'^ , 
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where we have used II A||f < (2(n — 1))^'^^. By Weyl's inequality (Corollary 7.3.8 of [34]), for all 

i < (pk, 

|cJi(LA) - aiiLA + F)| < ||F||2 < (271^)^/^ , 

where iTj(X) denotes the i-th singular value of a matrix X. Since L is orthonormal, cr* (LA) = a* (A) 
for all i. It follows that 

||LA-[LA]fc||| = j;a2(LA) 

i>k 

i>k 

= Y.[a,iA)±{2n^MY"y 

i>k 

= ||A - Afclll ± 0(n||A||2 + n^M^) 

= ||A-Afc|||±0(l/n2), (23) 

where the final line follows for a sufficiently small choice of M < 1 Irf, using that ||A ||2 < ||A||f = 

We now look at ||LA — PLA||f. By the triangle inequality, 

||LA-PLA||f > ||LA-PLA||f - ||EA||f - ||PEA||f 

> ||EA — PLA||f ~ ||E||f||A||f ~ ||P||2||E||f||A||f 
= ||A-LTpLA||F-0(l/n3), (24) 

where the equality uses that multiplying by preserves norms, as well as that ||E||f < l/poly(n) 
for an arbitrarily small poly(n) by making M < 1/rf sufficiently small, whereas ||P ||2 < 1 and 
||A||f = 0{y/n). We claim L^^PL = Q, where Q is the projection onto the columns in A indexed 
by the set S. To see this, let U be an orthonormal basis for the columns in A spanned by the index 
set S, so that P = LUU^L^. Then L^PL = UU^ = Q. Hence, using that || A — QA||f < || A||f = 
0{^/n) and squaring (24), we have, 

||LA - PLAIll > ||A - QA||| - 0(1^). (25) 

Combining (22), (23), and (25), 

||A - QA||| < (1 + e/3)||A - Afc||| + 0(1^). (26) 

Note that || A — A^Hf > 1 for any k < n, since if we remove the top row of A, creating a matrix B, 
then ||B — B^Hf < || A — A^Hf, yet B is the identity matrix, and so ||B — B^Hf > 1. It follows that 
the additive 0(l/n^) error in (26) translates into a relative error, provided k < n and for 1/e < n/3. 
Hence, || A — QA||| < (1 + e)|| A — Afc|||, and therefore by Fact 84, we have IS"! > k/{2e). ■ 

10.4.2 Column sharing of discretized orthonormal matrices 

We need the following technical lemma about sampling random orthonormal matrices, concern¬ 
ing the number of columns they have in common after rounding. 
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Lemma 86. Let 1 <r < n, and let M < Ijrf for a sufficiently large absolute constant c > 0. Suppose we 
choose two independently random orthogonal matrices A G and B G from the Haar measure 
(so they each have orthonormal columns and orthonormal rows). We then round each of the entries in A to 
the nearest integer multiple of M, obtaining a matrix A, and similarly obtain B. Let £ denote the event that 
there exist r distinct indices ii,... ,ir G [n] and r distinct indices ji, .. ., jV £ [n]for which A*^j^ = B*j^ 
for all £ = 1,2, ... ,r, where A^,i^ denotes the i^-th column of A. Then, Pr[£^] < ^ 

Proof. We fix a subset 5 of r distinct indices ii,. .. ,ir G [n] and a subset T of r disfinct indices 
jij - ■ ■ ijr £ [n] and show that the probability A*^j^ = B* simultaneously for all £ G [r] is at most 
^-e(nr log M) j|. follows by a union bound that 

—©(nrlogM) ^ ^ ne ^ log M) ^ ^—Q{nr\ogM) 


Pr[£:] < 


n 


as desired. 

Since A and B are independent and their distribution is permutation-invarianp we can assume 
H = je = ^ for £ G [r]. Let (F be the event that A*^i^ = B* occurs for all £ ^ [r]. \i F occurs, fhen 
we need || A*^i^ — B*j^||oo < M for all £ G [r], which implies || A*^i^ — B*j ^||2 < ffnM for all £ G [r]. 
Let A*" be the leftmost r columns of A, and B^ the leftmost r columns of B. Then, if F occurs 

r 

||A^ - B^li < ||A^ - BHlI = < rnM^. 

e=i 

Note that A'" and B^ have orthonormal columns, and so Pa'’ = (A'’)(A^)^ and Pb’’ = (B'’)(B'’)^ 
are projection matrices. Then, 

IIA" - > ||(A")(A")T - B"(A")T||2/2 + ||(A")(B")T - (B")(B'')T||2/2 

= ||(A'-)(A'')T - (An(B^)^||2/2 + ||(A'')(Bn^ - (Bn(B^)^||2/2 

> ||(A'')(A'')T-(B^)(B^)T||2/2, (27) 

where the first inequality follows since multiplying by a matrix on the right with orthonormal 
rows cannot increase the spectral norm, the equality follows by taking transposes, and the second 
inequality follows by fhe triangle inequality. 

We need another result from [38]. 

Theorem 87. (Probability Bound - Claim 5.2 of [38]) Suppose we choose two independent random sub¬ 
spaces Y and S o/M"' each of dimension r from the Haar measure, where we denote the projection operators 
onto the subspaces by YY"*^ and SS^, respectively, where Y and S have r orthonormal columns. Then for 
any 5^ (0,1), PrjUYY^ - SS '^||2 < 5] < 

Combining (27) with Theorem 87, we have that for r < n/2, Pr[J^] < g,-^pn\og(M))^ since by 
taking M < 1/rf sufficiently small, we can ensure ||A^ — B ’’||2 < ffrnM < and so 2M^/^ 

upper bounds ||(A’')(A^)^ — (B’’)(B^)^|| 2 . 

Note that if r > n/2, then in particular we still have that A* = B* for all £ G [r/2], and 
therefore in this case we also have Pr[J^] < g-^pniogM)^ Hence, Pr[£^] < g-^pniogM)^ which 
completes the proof. ■ 

We now extend Lemma 86 to a /c-fold version. 
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Lemma 88. Let M < 1/rf for a sufficiently large absolute constant c > 0. Suppose we independently 
choose k pairs of matrices A^,B^ G z g [k], each with orthonormal columns (and hence also 

orthonormal rows). We then round each of the entries in each and to the nearest integer multiple 
of M, obtaining matrices A^, and B^. Let r = (ri,..., r^) G (Z-°)^ and let £ be the event that for each 
z G [k], there exist distinct indices ii,..., G [n] and ji,... ,jrz G [n]for which = B^^j^for all 

e = l,2,...,r^. Then,FT[£] < 

Proof This follows by Lemma 86, and the fact that the matrices A^,..., A^, B^,..., B^ are jointly 
independent. Here we also use a union bound. ■ 

Corollary 89. Suppose 1 < k < confor an absolute constant cq > 0. Let r = (ri,..., r^) G Let 

M < 1/ n^for a sufficiently large constant c > 0. There exists a set J\f ofe^^^^ k-tuples (Ai,..., A^) 
of matrices each of dimensions nxn with entries that are integer multiples ofM and bounded in magnitude 
by 1, such that for every vector r = (ri,..., r^) G (Z-°)^ with holds that for all distinct k- 

tuples (Ai,..., Afc), (Bi,..., Bfc), there exists a z £ [k]for which there are no sets S' = {ii,..., T = 
{ji, • • •, jz} C [n] for which A*,i^ = B* for all £ = 1,2,, r^. 

Proof. For a fixed vector r and a random choice of (A^,..., A^), (B^,..., B^), by fhe guarantee of 
Lemma 88, the complement of the event in the statement of the corollary happens with probability 
at most number of vectors r with Yl!l=i ^2 < Hs at most ff, and < 

g-e(ntiogM) provided k < con for an absolute constant cq > 0, so we can apply a union bound 
over all pairs in a random choice of A:-tuples. ■ 

10.4.3 Hard instance construction 

We have s servers holding matrices Ai,..., A^, respectively, where Aj has m rows and a subset 
of Wi columns of an m x n matrix A, where Jfwi = n. For our lower bound we assume 2/e < f 
and kf < m. Note that for e less than a sufficiently small constant, this implies that k < cokf, and 
so the assumption in Corollary 89 is valid (since the n of that corollary is equal to (k + l)(j) We set 
M = 1/ (mn)'^ for a sufficiently large constant c > 0. 

We will have Ai being an m x /c(0 — 1) matrix for which all but the first kf rows are zero. We 
assume 2/e < f. On the first kf rows, Ai is block diagonal containing k blocks, each block being 
a 0 X (0 — 1) matrix. 

To specify Ai, let the parameters t and n of Corollary 89 equal k/ (2e) and f, respectively. That 
corollary gives us a net M of /c-tuples of matrices each of dimensions f x f wifh 

entries that are integer multiple of M and bounded in magnitude by 1. 

For a random fc-tuple (B^,..., B^) in J\f, we create anmxm block diagonal matrix B, which is 
0 on all but its first kcj) rows and first kf columns. On its first kf rows, it is a block diagonal matrix 
B whose blocks along the diagonal are B^,..., B^, respectively. Our matrix Ai is then the matrix 
product of B and an mafrix D, where D is m x k((j) — 1) is formed by taking the kf x k{(l) — 1) 
matrix of Fact 84 and padding it with m — kf rows which are all zeros. By Lemma 85, no k/(2e) 
columns of Ai contain a A:-dimensional subspace in their span which is a (1 + e)-approximation to 
the best rank-/c approximationt to Ai. 

Each Aj for i = 2,3,...,s — lis fhe mxm matrix of all zeros. Finally, A^ is fhe mxt mafrix of 
all zeros with t = n - (s - l)m - kif - 1), i.e., A = (Ai O^xm O^xm • • • O^xm O^xt) • By 
construction, each column of A has at most f non-zero entries, since the only non-zero columns 
are those in Ai, and each column in Ai has the form B^ -|- B^ ^ for a z £ [k] and an i G [</>], so it 
has at most f non-zero entries. 
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10.4.4 Main theorem 

In the Distributed Column Subset Selection Problem in Definition 18, a correct protocol should 
have each of the s machines simultaneously output a matrix C G with c < n columns of A 
such that 

||A-CCtA||| < (l + e)||A-Afclll. 

Theorem 90. Assume ‘Ije < cj) and k(j) < min(m, n). Then, any, possibly randomized, protocol n 
which succeeds with probability at least 2/3 in solving the Distributed Column Subset Selection Problem 
of Definition 18, under the promise that each column of A has at most f non-zero entries which are integer 
multiples ofl/{mnY and bounded in magnitude by I, for a constant c > 0, requires D{s<f)k{log{mn))/e) 
bits of communication. If the word size is 0(log(mn)), this implies the communication cost of the porticol 
is D{s4>k/e) words. 

Proof. By construction of Ai, whenever 11 succeeds, each of the s machines outputs the same set 
C of at least t>k/ (2e) columns of Ai. 

Let n* denote the transcript (sequence of all messages) between the coordinator and the f-th 
player. We look at the information 11* reveals about Ai, given the event £ that n succeeds, that is, 

/(n*;Ai |£:). 

A critical observation is that any column of C, is equal to ^ ^ for some z G [k], 

where B^ is as defined above (i.e., Ai = B • D). Hence, given B^ ^ and one can reconstruct 
B^ j. Now, for any random variable W, 

/(n*; Ai I £:) > /(n*; Ai \£,W)- HiW). (28) 

We let W = (Bj; , B^ ^), and observe that ff(W) < log2((2M)^^) since there are 2M choices 
for each coordinate of each B^ ^. Hence, IP) = 0(kflog(mn)). Note that/(n*; Ai | £,W) = 
iL(Ai) = log 2 (|W|), since conditioned on the protocol succeeding, and given W, by Corollary 89 
we can reconstruct Ai. By (28), it follows that 

(n*; Ai I £) > 0(fk(log(mn))/€) — 0{k(/)log{mn)) = D{(l)k(log{mn))/e). 

Note that if Z is an indicator random variable that indicates that £ occurs, we have H{Z) < 1, and 
so 

/(n*; Ai) > /(n*; Ai | £)Pi[£] - 1 = 52(#(log(mn))/e), 

since Pr[£^] >2/3. 

It follows that 


L!(#(log(mn))/e) < /(B*; Ai) < iL(n*) < EAi,w[|n*|], 

where |n*| denotes the length, in bits, of the sequence 11* of messages, and rand is the concatena¬ 
tion of the private randomness of all s machines. Note that the entropy of 11* is a lower bound on 
the expected encoding length of 111*|, by the Shannon coding theorem. By linearity of expectation, 

EEa. ,mnd|n*| = D(#(log(mn))/e), 
i 

which implies there exists an Ai and setting of rand for which 

|n*| = Q{s(j)k{log(mn))/e). 

i 

It follows that the total communication is Pl(s(j)k{\og{mn)) / e) bits. ■ 
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Corollary 91. Assume ‘Ije < (/> and kcj) < min(m, n). Then, any, possibly randomized, protocol n 
which succeeds with probability at least 2j3 in solving the Distributed Column Subset Selection Problem 
- rank k subspace version (see Definition 19), promise that each column of A has at most f non-zero 
entries which are integer multiples of 1/(mn)^ and bounded in magnitude by 1, for a constant c > 0, 
requires VL{s(l)k{\og{mn)) / e) bits of communication. If the word size is 0(log(mn)), this implies the total 
communication across all s machines is D{s4>k/e) words. 

Proof. As any such protocol is also a protocol for the Distributed Column Subset Selection Prob¬ 
lem of Definition 18 (the problem in Definition 19 is more general/difficult than the problem of 
Definition 18), the proof follows immediately from Theorem 90. ■ 
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